The AI margin problem: why scale should cut your cost per call
Software got its valuations from a simple fact: serving one more customer cost almost nothing, so margins widened as the business grew. AI products broke that. Every call pays a frontier model, so growth adds cost instead of removing it. Here is the problem in plain terms, why it shows up in your valuation, and the shape of the fix.
Short answer
AI app gross margins are squeezed by per-request inference cost, and for many products they get worse as usage grows. That depresses the valuation multiple, because revenue that carries a usage-linked cost is worth less than high-margin recurring revenue. The fix is to make the cost per call fall as volume rises, which is exactly what happens when more of your repetitive traffic gets routed to a near-free model.
The margin every AI founder eventually hits
A traditional software company runs at roughly 70 to 80 percent gross margin. The cost of serving one more request rounds to zero, so each new customer makes the business more profitable per dollar of revenue. That is the entire reason software earns the multiples it does.
An AI product does not get this for free. It pays a frontier model on every call, and that inference cost is a real cost of goods sold that rises with every active user and every feature that calls the model. Gross margins land well below the software benchmark, and the gap widens with the most engaged customers, who are also the most expensive to serve.
Why this shows up in your valuation
A revenue multiple is a bet on how much of each dollar eventually becomes profit, and how durable that dollar is. Recurring software revenue at high margin converts cleanly and compounds, so it earns a premium. Revenue that drags a per-request cost behind it converts to less profit and looks more exposed to model pricing and usage spikes, so the market pays less for it. Two companies with the same revenue and growth can be valued very differently once you account for the cost structure underneath.
The usual cost levers only slow the bleed
The standard moves are a smaller prompt, a cheaper general model, and a cache. Each helps. Each also leaves the fundamental shape intact: cost still climbs with every call you serve. You have lowered the slope of the line. The line still points up and to the right, in step with your traffic.
| Approach | Effect on cost per call | Effect as you scale |
|---|---|---|
| Smaller prompt | Lower by a fixed factor | Still grows with volume |
| Cheaper general model | Lower by a fixed factor | Still grows with volume |
| Cache exact repeats | Lower on identical hits | Still grows with variety |
| Certify the repetitive slice | Falls toward zero on that slice | Improves with volume |
The dynamic you actually want
You want the cost per call to fall as you grow, so margins widen with scale the way software does. That requires something counterintuitive: more traffic has to make each call cheaper, on average, rather than just adding more identical cost.
Repetition is what makes that possible. A product at scale runs the same kinds of decisions over and over. The more volume you have, the more of it is predictable, and the more of it can be answered by a small model that costs almost nothing per call. The expensive frontier model stays in place for the genuinely hard and novel requests.
How TRACER inverts the curve
TRACER learns from your own traces which slices of traffic are safe to serve from a near-zero-cost classifier, and certifies each slice against your teacher model with a calibrated parity gate. As your volume grows, the certified share grows with it, so the blended cost per call declines as you scale. Growth starts widening your margins instead of compressing them. The reference point is our Obside case study, where a frontier call per item was replaced by a 38-cell surrogate at 95 percent saved, holding accuracy against the teacher.
The shift in one line: the more traffic you have, the more of it TRACER can move off the frontier model, so your cost per call goes down as you grow rather than up.
This is why we built an open-source library as well as a hosted product. Capturing your traces and inspecting your own partition costs nothing, and it is the first step to turning your inference bill from a tax on growth into something that scales in your favour. The practical sequence is in the cost playbook for AI startups.
Frequently asked questions
Why are AI app gross margins worse than SaaS?
Classic software has near-zero marginal cost per request, so gross margins sit around 70 to 80 percent and improve with scale. An AI product pays a frontier model on every call, so inference is a real cost of goods sold that grows with usage. Margins land lower and do not get the same lift from growth.
Why does a low gross margin cap valuation?
Investors pay a multiple on revenue that depends on how much of it reaches the bottom line and how durable it is. Revenue tied to a per-request cost that scales with usage converts to less profit and looks more fragile, so it earns a lower multiple than high-margin recurring software.
Do smaller models or caching fix the margin problem?
They lower the per-call cost by a roughly fixed factor, but the cost still scales linearly with volume. They slow the bleed. They do not change the shape of the curve, where every new call adds cost.
How does TRACER make cost per call fall as volume rises?
More traffic means more repetition. As volume grows, a larger share of calls becomes predictable enough to certify to a near-zero-cost classifier, with a parity gate holding quality. The blended cost per call drops as you scale, which is the dynamic that improves margins with growth.
TRACER is open source. Run pip install tracer-llm, point it
at your traces, and see how much of your traffic is already predictable.
The hosted version meters live savings and certifies activation at
app.tracerml.ai.