How to cut LLM costs on customer support ticket triage

Support inboxes are the most repetitive LLM workload most teams run. The same intents come back thousands of times a day, and you are paying a frontier model to re-decide them every time. Here is how to move the repeats to a near-free classifier and keep the rest on the LLM.

Short answer

Most support tickets fall into a small set of repeated intents. Route the share you can certify to a small classifier for near-zero cost, defer the rare and ambiguous tickets to the LLM, and you keep accuracy while cutting the bill on the bulk of the traffic.

Why support triage is expensive

A support classifier built on an LLM reads each ticket and labels it: billing question, password reset, refund request, cancellation, fraud report, shipping delay. The work is genuinely easy for most tickets, and that is the problem. You are spending frontier-model tokens on "where is my order" several thousand times a day. The cost scales with volume, and volume is exactly what a growing support operation produces.

The fix is to stop calling a language model at all for the tickets whose answer is already decided by your own history. A smaller prompt or a cheaper general model still pays per token on every one of them.

Which tickets are safe to move off the LLM

Group your past tickets by the decision your LLM made, then look at how consistent each group is. A "password reset" region where the teacher agreed with itself on almost every held-out example is safe to serve from a small model. A region that mixes refunds and fraud reports is not, so it stays on the LLM. The split is per region, so you certify the clean intents and leave the messy ones alone.

How much can you actually save

The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. On a clean, repetitive stream that share can run very high. In our Obside case study, a frontier call per item was replaced by a 38-cell surrogate at 95 percent saved, holding accuracy against the teacher.

TrafficBeforeAfter
Certified intentsFrontier LLM callSmall classifier, near-zero
Rare or ambiguousFrontier LLM callFrontier LLM call (deferred)

Fine-grained taxonomies save less, and we are honest about that. A support model with dozens of near-identical intents fragments into small regions that are harder to certify. The right move there is a hybrid that certifies what it can and defers the rest, rather than forcing coverage.

How do you prove accuracy holds

Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out tickets. A region only routes to the small model when that bound clears the target you set, for example 98 percent agreement. Everything else defers. You get an audit trail per region: the dominant intent, real example tickets, and the error bound, so a support lead can see why a region is safe before any real traffic moves. For the bigger picture on why this matters to your unit economics, see the AI margin problem.

How to cut the cost, step by step

What you need: a few thousand recent support tickets, each paired with the intent your LLM already assigned to it. No hand-labelling. Your own traffic is the training signal.

  1. Collect your support traces

    Export recent tickets with the label your LLM produced for each. A few thousand is enough to cover the common intents.

  2. Build the partition

    Run pip install tracer-llm and fit on your traces. TRACER groups tickets by the decision the LLM made, then learns where a new ticket lands.

  3. Read the certified intents

    Open the report and see which intents clear your target agreement, for example 98 percent, on held-out tickets. Each one shows its dominant intent, real example tickets, and its error bound.

  4. Activate the safe intents

    Route the certified intents to the small classifier and keep rare, ambiguous, or sensitive tickets on the LLM. The out-of-distribution gate sends anything unfamiliar back to the teacher model.

  5. Meter and re-certify

    Watch live coverage, savings, and agreement. As your ticket mix shifts over time, re-fit so the accuracy guarantee keeps holding.

The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.

Frequently asked questions

How much can you save on LLM support triage?

It depends on how repetitive your tickets are. On predictable streams, teams certify a large share of traffic to a small classifier and pay frontier prices only on the rare or ambiguous tickets. In one production case on a clean stream, 95 percent of calls moved to a small surrogate at parity with the teacher model.

Will routing tickets to a cheap model hurt accuracy?

Only the tickets that clear a calibrated accuracy bound get routed to the cheap model. Every other ticket is deferred to the LLM. The bound is computed per region of traffic on held-out data, so the cheap path carries a guarantee rather than a guess.

Do I need labeled data to start?

No. You use your existing LLM as the teacher. TRACER learns from the prompt and the model's own answer, so your past traffic is the training signal. A few thousand traces is enough to start certifying the easy intents.

What about unusual or sensitive tickets?

They defer to the LLM by design. An out-of-distribution gate sends anything that does not look like certified traffic back to the teacher model, so a strange ticket is never force-fit to the cheap path.

TRACER is open source. Run pip install tracer-llm, point it at your support traces, and see which intents certify. The hosted version adds a live meter and one-click activation at app.tracerml.ai.

← All posts