How to cut LLM costs on chatbot and assistant intent detection
Conversational assistants call a language model on every turn to detect the user's intent before they can act. A handful of common intents make up most of the traffic. Here is how to detect the common ones with a near-free classifier and keep the LLM for the rare and ambiguous turns.
Short answer
Chatbot intent detection runs on every turn and is dominated by a few common intents. Route the high-frequency intents you can certify to a small classifier, defer the rare and ambiguous turns to the LLM, and cut the per-turn cost of your assistant without changing its intent set.
Why per-turn intent detection with an LLM gets expensive
An assistant calls a language model on each user turn to map it to an intent before routing. A balance check is one call. A store-hours question is another. The model is doing real work, and a small set of common intents make up most of the volume. Multiplied across every turn of every conversation, the intent step alone is a large and constant cost.
Which intents are safe to detect with a cheap model
Group your past turns by the intent your LLM assigned, then check how consistent each group is on held-out turns. The high-frequency, single-meaning intents form tight regions a small model reproduces. The rare intents and the genuinely ambiguous or multi-intent turns stay on the LLM.
- Safe to route: high-frequency, single-meaning intents like balance checks, hours, status, simple how-to.
- Keep on the LLM: rare intents, multi-intent turns, anything that reads unlike your normal traffic.
How much can you actually save
The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. The reference point on this site is the Obside case study, a clean classification stream where a 38-cell surrogate replaced the frontier call at 95 percent saved.
| Turn | Before | After |
|---|---|---|
| Certified turn | Frontier LLM call | Small classifier, near-zero |
| Rare or ambiguous | Frontier LLM call | Frontier LLM call (deferred) |
Assistants with a very large, fine-grained intent set save less, and we are honest about that. Many near-identical intents fragment into small regions that are harder to certify. The right move there is a hybrid that certifies the common intents and defers the long tail, rather than forcing coverage.
How do you prove quality holds
Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out turns. A region only routes to the small model when that bound clears the target you set. Everything else defers. You get an audit trail per region: the dominant label, real examples, and the error bound, so a conversational AI lead can see why a region is safe before any real traffic moves. For why this matters to your unit economics, see the AI margin problem.
How to cut the cost, step by step
What you need: a few thousand recent turns, each paired with the label your LLM already produced. No hand-labelling. Your own traffic is the training signal.
-
Export your intent traces
Pull recent user turns paired with the intent your LLM detected for each.
-
Build the partition
Run
pip install tracer-llmand fit on your traces. TRACER groups turns by the intent the LLM assigned, then learns where a new turn lands. -
Read the certified intents
See which intents clear your target agreement on held-out turns. Each region shows its dominant intent, real example turns, and its error bound.
-
Activate the common intents
Route the certified intents to the small classifier and keep rare or ambiguous turns on the LLM. The out-of-distribution gate sends unfamiliar turns back to the teacher model.
-
Meter and re-certify
Track live coverage, savings, and agreement. As your assistant's usage evolves, re-fit so the guarantee keeps holding.
The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.
Frequently asked questions
Can a small model detect intent as well as an LLM?
For the common, single-meaning intents, yes, and in single-digit milliseconds. The rare or ambiguous turns defer to the LLM. You only route a turn to the cheap model when a calibrated accuracy bound clears your target.
How much does per-turn intent detection cost at scale?
Every turn of every conversation triggers a call. At high conversation volume, the intent step is a large and constant line item. Moving the common intents to a classifier removes that cost from the bulk of the turns, and the cheap path is fast enough to keep the bot responsive.
Does this work with my intent set?
Yes. TRACER learns whatever intents your LLM already produces. It uses your past decisions as the training signal, so the intent set is unchanged.
What happens to a rare or ambiguous turn?
It defers to the LLM. An out-of-distribution gate routes anything that does not resemble certified traffic back to the teacher model, so an unusual turn is never resolved by the cheap path on a guess.
TRACER is open source. Run pip install tracer-llm, point it
at your traces, and see which turns certify. The hosted version adds a
live meter and one-click activation at
app.tracerml.ai.