How to cut LLM costs on question routing and FAQ deflection

Support and retrieval systems call a language model to route each incoming question: which canned answer, which retriever or index, or whether to escalate. Many questions map to a small set of known answers. Here is how to route the predictable ones with a near-free classifier and keep the LLM for the novel cases.

Short answer

Question routing and FAQ deflection run an LLM on every question to decide where it goes. Most questions map to a known answer or retriever. Route the predictable ones through a small classifier you have certified, defer the novel ones to the LLM, and cut the cost of the routing step itself.

Why routing every question with an LLM gets expensive

A support or retrieval system calls a language model on each question to decide what to do: answer from a known FAQ, send it to a particular retriever or index, or escalate to a human. Many of those questions are familiar and map cleanly to one destination. The model is doing real work, and most of that work is the same routing decision over and over. At question volume, the routing step alone is a standing cost, and it sits in front of every other cost in the pipeline.

Which questions are safe to route with a cheap model

Group your past questions by the route your LLM chose, then check how consistent each group is on held-out questions. The questions that map cleanly to one FAQ answer or one retriever form tight regions a small model reproduces. The novel, multi-part, or genuinely ambiguous questions stay on the LLM.

How much can you actually save

The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. The reference point on this site is the getclaw case study, the same idea applied to agents, where routing tool selection through a TRACER classifier cut end-to-end cost by around 50 percent.

QuestionBeforeAfter
Certified questionFrontier LLM callSmall classifier, near-zero
Rare or ambiguousFrontier LLM callFrontier LLM call (deferred)

Routing sits in front of retrieval, so a cheap routing decision also lets you skip work downstream. When the classifier can confidently route a question to a known answer, you can deflect it without a retrieval call or a generation call at all.

How do you prove quality holds

Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out questions. A region only routes to the small model when that bound clears the target you set. Everything else defers. You get an audit trail per region: the dominant label, real examples, and the error bound, so a support or platform lead can see why a region is safe before any real traffic moves. For why this matters to your unit economics, see the AI margin problem.

How to cut the cost, step by step

What you need: a few thousand recent questions, each paired with the label your LLM already produced. No hand-labelling. Your own traffic is the training signal.

  1. Export your routing traces

    Pull recent questions paired with the route your LLM chose: which FAQ answer, which retriever, or escalate.

  2. Build the partition

    Run pip install tracer-llm and fit on your traces. TRACER groups questions by the route the LLM chose, then learns where a new question goes.

  3. Read the certified routes

    See which routes clear your target agreement on held-out questions. Each region shows its dominant route, real example questions, and its error bound.

  4. Activate the predictable routes

    Route the certified questions through the small classifier and defer novel or ambiguous questions to the LLM. The out-of-distribution gate sends unfamiliar questions back to the teacher model.

  5. Meter and re-certify

    Track live coverage, savings, and agreement. As your FAQ and content change, re-fit so the guarantee keeps holding.

The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.

Frequently asked questions

Can a small model route questions as well as an LLM?

For the predictable questions that map cleanly to one destination, yes. The novel or ambiguous questions defer to the LLM. You only route a question through the cheap model when a calibrated accuracy bound clears your target.

How does routing cheaply also cut downstream cost?

Routing sits in front of retrieval and generation. When the classifier confidently routes a question to a known answer, you can deflect it without a retrieval call or a generation call, so the saving compounds beyond the routing step.

Does this work with my retrievers and FAQ set?

Yes. TRACER learns whatever routes your LLM already chooses. It uses your past decisions as the training signal, so your routing targets are unchanged.

What happens to a novel question?

It defers to the LLM. An out-of-distribution gate routes anything that does not resemble certified traffic back to the teacher model, so a novel question is never routed by the cheap path on a guess.

TRACER is open source. Run pip install tracer-llm, point it at your traces, and see which questions certify. The hosted version adds a live meter and one-click activation at app.tracerml.ai.

← All posts