How to cut LLM costs on question routing and FAQ deflection
Support and retrieval systems call a language model to route each incoming question: which canned answer, which retriever or index, or whether to escalate. Many questions map to a small set of known answers. Here is how to route the predictable ones with a near-free classifier and keep the LLM for the novel cases.
Short answer
Question routing and FAQ deflection run an LLM on every question to decide where it goes. Most questions map to a known answer or retriever. Route the predictable ones through a small classifier you have certified, defer the novel ones to the LLM, and cut the cost of the routing step itself.
Why routing every question with an LLM gets expensive
A support or retrieval system calls a language model on each question to decide what to do: answer from a known FAQ, send it to a particular retriever or index, or escalate to a human. Many of those questions are familiar and map cleanly to one destination. The model is doing real work, and most of that work is the same routing decision over and over. At question volume, the routing step alone is a standing cost, and it sits in front of every other cost in the pipeline.
Which questions are safe to route with a cheap model
Group your past questions by the route your LLM chose, then check how consistent each group is on held-out questions. The questions that map cleanly to one FAQ answer or one retriever form tight regions a small model reproduces. The novel, multi-part, or genuinely ambiguous questions stay on the LLM.
- Safe to route: questions that map to a known FAQ answer or a single retriever or index.
- Keep on the LLM: novel or multi-part questions, anything ambiguous about where it should go.
How much can you actually save
The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. The reference point on this site is the getclaw case study, the same idea applied to agents, where routing tool selection through a TRACER classifier cut end-to-end cost by around 50 percent.
| Question | Before | After |
|---|---|---|
| Certified question | Frontier LLM call | Small classifier, near-zero |
| Rare or ambiguous | Frontier LLM call | Frontier LLM call (deferred) |
Routing sits in front of retrieval, so a cheap routing decision also lets you skip work downstream. When the classifier can confidently route a question to a known answer, you can deflect it without a retrieval call or a generation call at all.
How do you prove quality holds
Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out questions. A region only routes to the small model when that bound clears the target you set. Everything else defers. You get an audit trail per region: the dominant label, real examples, and the error bound, so a support or platform lead can see why a region is safe before any real traffic moves. For why this matters to your unit economics, see the AI margin problem.
How to cut the cost, step by step
What you need: a few thousand recent questions, each paired with the label your LLM already produced. No hand-labelling. Your own traffic is the training signal.
-
Export your routing traces
Pull recent questions paired with the route your LLM chose: which FAQ answer, which retriever, or escalate.
-
Build the partition
Run
pip install tracer-llmand fit on your traces. TRACER groups questions by the route the LLM chose, then learns where a new question goes. -
Read the certified routes
See which routes clear your target agreement on held-out questions. Each region shows its dominant route, real example questions, and its error bound.
-
Activate the predictable routes
Route the certified questions through the small classifier and defer novel or ambiguous questions to the LLM. The out-of-distribution gate sends unfamiliar questions back to the teacher model.
-
Meter and re-certify
Track live coverage, savings, and agreement. As your FAQ and content change, re-fit so the guarantee keeps holding.
The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.
Frequently asked questions
Can a small model route questions as well as an LLM?
For the predictable questions that map cleanly to one destination, yes. The novel or ambiguous questions defer to the LLM. You only route a question through the cheap model when a calibrated accuracy bound clears your target.
How does routing cheaply also cut downstream cost?
Routing sits in front of retrieval and generation. When the classifier confidently routes a question to a known answer, you can deflect it without a retrieval call or a generation call, so the saving compounds beyond the routing step.
Does this work with my retrievers and FAQ set?
Yes. TRACER learns whatever routes your LLM already chooses. It uses your past decisions as the training signal, so your routing targets are unchanged.
What happens to a novel question?
It defers to the LLM. An out-of-distribution gate routes anything that does not resemble certified traffic back to the teacher model, so a novel question is never routed by the cheap path on a guess.
TRACER is open source. Run pip install tracer-llm, point it
at your traces, and see which questions certify. The hosted version adds a
live meter and one-click activation at
app.tracerml.ai.