We Track Every AI Token. Here's Why That Matters More Than You Think.
TL;DR: We built a cost tracking system that captures every AI API call in our website generation pipeline—LLMs, image generation, web search—and writes a cost record to DynamoDB. We know exactly what each website costs, broken down by step, provider, and model. The system is fire-and-forget (never blocks generation), supports time-versioned pricing, and auto-expires dev records after 90 days.
Why Track AI Costs at the Token Level?
When you’re making dozens of AI calls to generate a single website, costs add up in unexpected places. Without granular tracking, you’re flying blind:
- Is the strategy step or the content step more expensive?
- Did switching from one model to another actually save money?
- Which businesses cost the most to generate? (Hint: restaurants with long menus.)
- Is our image generation spend growing faster than our LLM spend?
We needed answers at the per-call, per-step, per-business level. Helicone gives us provider-level observability, but we wanted cost data inside our own database—queryable, aggregatable, and tied to our business logic.
The Architecture
Fire-and-Forget Logger
The core design principle: cost logging never blocks website generation. If DynamoDB is slow or down, generation continues. The logger catches errors silently.
export function logCost(entry: CostLogEntry): void {
_writeRecord(entry).catch((err) => {
console.warn('⚠️ [COST] Failed to log cost:', err?.message ?? err);
});
}
No await. No error propagation. The caller moves on immediately.
What Gets Logged
Every cost record captures:
| Field | Example |
|---|---|
businessId |
"joes-pizza-brooklyn" |
stepId |
"strategy", "webSearch", "google-image-search" |
provider |
"openai", "anthropic", "deepseek", "fal" |
model |
"gpt-4o", "claude-opus-4-5", "flux-2" |
usage |
{ type: "token", inputTokens: 4200, outputTokens: 1800 } |
durationMs |
3400 |
environment |
"production" |
Four Pricing Models
Not all AI calls are billed the same way. We support four pricing types in a single JSON config:
- Token pricing: LLMs (input + output tokens, priced per million)
- Megapixel pricing: Some image generators charge by pixel area
- Per-image pricing: Others charge a flat rate per image
- Per-query pricing: Web search APIs charge per request
Each model has its own entry in a versioned pricing table. When a provider changes their rates, we add a new entry with a future effectiveDate. Historical records stay accurate because they were calculated with the pricing active at the time.
Time-Versioned Pricing
Prices change constantly in the AI space. Our lookup uses the newest price entry that’s <= the request date:
// Returns the pricing entry active on the given date
const entries = pricing[model]
.filter(e => e.effectiveDate <= dateString)
.sort((a, b) => b.effectiveDate.localeCompare(a.effectiveDate));
return entries[0];
This means we can look back at what a website cost to generate three months ago and get accurate numbers—even if prices have changed twice since then.
Integration: Zero-Touch for Developers
Developers writing new pipeline steps don’t need to think about cost tracking. It’s wired into our AI abstraction layer:
function createUsageLogger(
provider: string,
model: string,
businessId?: string,
stepId?: string,
startTime?: number
) {
return (usage: { inputTokens: number; outputTokens: number }) => {
logCost({
businessId: businessId || 'unknown',
stepId,
provider,
model,
usage: { type: 'token', ...usage },
durationMs: startTime ? Date.now() - startTime : undefined,
});
};
}
Every call to generateContent() or generateStructuredData() automatically creates a usage callback. When the provider returns token counts, the callback fires and logs the cost. No manual instrumentation needed.
DynamoDB Schema
We use a dual-access pattern:
By Business (primary key):
PK = "BIZ#joes-pizza-brooklyn",SK = "2026-03-03T14:22:00Z#uuid"- Query: “Show me all AI costs for this business”
By Date (GSI):
GSI1PK = "DATE#2026-03-03",GSI1SK = "2026-03-03T14:22:00Z#uuid"- Query: “Show me all AI costs from today”
Non-production records get a 90-day TTL. Production records persist indefinitely.
What We’ve Learned
Tracking every token across thousands of generated websites has revealed patterns we never would have found otherwise:
The expensive steps aren’t what you’d guess
Before we had data, we assumed content generation (the step that writes all the page copy) would be the most expensive. It wasn’t even close. The steps that require the most intelligence—strategy and research—dominate cost, even though they produce less raw output. A single strategy call that plans your entire site structure costs more than all six content sections combined.
Model choice matters more than prompt optimization
We spent weeks trying to optimize prompts to reduce token usage. It helped marginally. Then we ran the same pipeline through different models and saw cost differences of 10–20x for comparable quality. Picking the right model for each step is the single biggest lever on unit economics.
Some businesses are 3–4x more expensive to generate than others
Restaurants with extensive menus, multi-location service companies, and businesses with complex service hierarchies require significantly more tokens. We now know this in advance and can plan accordingly.
Image generation is a bigger cost driver than expected
When you’re generating hero images, section photos, and logos for every site, image costs add up fast. This is where model routing and quality thresholds matter most—you don’t need your best image model for every section background.
Admin Dashboard
We built an admin API that aggregates costs by business, provider, and step:
GET /api/admin/costs?from=2026-03-01&to=2026-03-03&limit=5000
Response includes total spend, per-business breakdowns, per-provider splits, and per-step attribution. We review this weekly to catch cost anomalies—like when a prompt change accidentally doubled our token usage on the assembly step.
What We’d Do Differently
-
Track actual tokens, not estimates. Our evaluation scripts use
promptLength / 4for token estimation. In production, we use actual counts from provider responses. The evaluation scripts should too. -
Add cost alerts. We track costs but don’t alert when a single business costs significantly more than the average. We’ve seen outliers that we only discovered during weekly reviews.
The Takeaway
If you’re building an AI product with multiple providers and multiple call types, instrument cost tracking from day one. Not tomorrow. Not after launch. Day one.
The data pays for itself immediately:
- Model selection: Hard data on cost-per-quality, not vibes
- Regression detection: Prompt changes that double costs show up instantly
- Unit economics: Know your margins at the individual-customer level
- Provider negotiation: Show your rep exactly how many tokens you’re running monthly
The specifics of our cost structure are proprietary—but the architecture to track it isn’t. Every AI product should know exactly what it costs to serve each customer. Most don’t. That’s a competitive advantage we’re not giving up.
Try It
Every website generated on WebZum has its costs tracked at the token level. The AI picks the most cost-effective provider for each step, and we invest heavily in the generation pipeline to deliver a quality result. That’s how we keep pricing at $19/month with everything included.