The AI margin problem: why scale should cut your cost per call
AI app margins get squeezed by per-request inference cost, and often worsen as you grow, which caps valuation. Here is how to invert the curve so cost per call falls as volume rises.
Blog
On the economics of running AI in production: why inference cost decides your margins, how to build cost discipline from day one, and how to route the repetitive traffic to near-free models while holding a parity guarantee. Written for founders and the engineers who own the bill.
AI app margins get squeezed by per-request inference cost, and often worsen as you grow, which caps valuation. Here is how to invert the curve so cost per call falls as volume rises.
Find product-market fit on the strongest model, install observability from your first call even on free credits, then route the repetitive traffic to a near-free model. The traces are the asset.
Support inboxes are the most repetitive LLM workload most teams run. Route the intents you can certify to a near-free classifier, defer the rest to the LLM, and cut the bill without losing accuracy.
Buyer-intent scoring runs an LLM over every reply. Score the clear-cut leads with a small classifier, defer the borderline ones, and cut the cost of qualifying the bulk of your pipeline.
Agents spend most of their tokens deciding which tool to call next, and that decision repeats. Route it through a small classifier and cut end-to-end agent cost without changing the agent's behaviour.
Route the clear-cut replies to a small classifier, defer the nuanced ones, and cut the cost of prioritizing your inbox.
Route the clear-cut companies to a small classifier, defer the ambiguous ones, and cut the cost of firmographic tagging.
Route the clear-cut items to a small classifier, defer the rest, and cut the cost of classifying a high-volume feed.
Route the common intents to a small classifier, defer the rare and ambiguous ones, and cut the per-turn cost of your bot.
Route the fintech support intents you can certify to a small classifier, defer the rest, and cut the bill while holding accuracy.
Route the clearly benign and clearly violating regions to a small classifier, keep the borderline cases on the LLM and human review.
Route the repeat e-commerce questions to a small classifier, defer the complex ones, and cut the bill at peak.
Route the predictable questions with a small classifier, defer the novel ones, and cut the cost of the routing step.