How to cut LLM costs on email reply prioritization
Sales and support teams run an LLM over every inbound reply to decide what to handle now: hot reply now, warm followup, cold nurture, skip. Most of those replies are obvious. Here is how to triage the obvious ones with a near-free classifier and keep the LLM for the judgment calls.
Short answer
Reply prioritization is high-volume and mostly repetitive. Route the clear-cut replies to a small classifier you have certified against your own LLM, defer the ambiguous ones, and you cut the cost of triaging the bulk of your inbox without changing the buckets your team works with.
Why prioritizing every reply with an LLM gets expensive
Inbox tools call a language model on each reply to rank how urgently a human should act. A hard unsubscribe is one call. A clear book-a-meeting is another. A flat not-interested is another. The model is doing real work, and most of that work is the same few rankings over and over. At inbox scale, the triage step alone is a standing bill.
Which replies are safe to triage with a cheap model
Group your past replies by the priority your LLM assigned, then check how consistent each group is on held-out replies. The unambiguous buckets, a hard opt-out or a clear meeting request, form tight regions a small model reproduces. The murky ones, a lukewarm reply that could be a warm followup or a cold nurture, stay on the LLM.
- Safe to route: explicit opt-outs, clear hot replies, obvious skips and bounces.
- Keep on the LLM: multi-topic replies, ambiguous tone, anything that reads unlike your normal inbox.
How much can you actually save
The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. The reference point on this site is the Obside case study, a clean classification stream where a 38-cell surrogate replaced the frontier call at 95 percent saved.
| Reply | Before | After |
|---|---|---|
| Certified reply | Frontier LLM call | Small classifier, near-zero |
| Rare or ambiguous | Frontier LLM call | Frontier LLM call (deferred) |
How do you prove quality holds
Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out replies. A region only routes to the small model when that bound clears the target you set. Everything else defers. You get an audit trail per region: the dominant label, real examples, and the error bound, so an inbox owner can see why a region is safe before any real traffic moves. For why this matters to your unit economics, see the AI margin problem.
How to cut the cost, step by step
What you need: a few thousand recent replies, each paired with the label your LLM already produced. No hand-labelling. Your own traffic is the training signal.
-
Export your reply-triage traces
Pull recent replies paired with the priority your LLM produced: hot reply now, warm followup, cold nurture, skip.
-
Build the partition
Run
pip install tracer-llmand fit on your traces. TRACER groups replies by the priority the LLM assigned, then learns where a new reply lands. -
Read the certified buckets
See which priority buckets clear your target agreement on held-out replies. Each region shows its dominant priority, real example replies, and its error bound.
-
Activate the clear-cut buckets
Route the certified buckets to the small classifier and keep nuanced replies on the LLM. The out-of-distribution gate sends unfamiliar replies back to the teacher model.
-
Meter and re-certify
Track live coverage, savings, and agreement. As your inbound mix changes, re-fit so the guarantee keeps holding.
The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.
Frequently asked questions
Can a small model triage replies as well as an LLM?
For the clear-cut replies, yes. A hard opt-out or an obvious meeting request is unambiguous, and a small classifier handles it for near-zero cost. The replies that genuinely need judgment defer to the LLM. You only route a reply to the cheap model when a calibrated accuracy bound clears your target.
How much does reply triage with an LLM cost at scale?
Every inbound reply triggers a call to rank its urgency. At thousands of replies a day, the triage step alone becomes a meaningful line item. Moving the clear-cut share to a classifier removes that cost from the bulk of the volume.
Does this work with my own priority buckets?
Yes. TRACER learns whatever buckets your LLM already produces, for example hot reply now, warm followup, cold nurture, and skip. It uses your past decisions as the training signal, so the buckets match what your team already uses.
What happens to a reply the model has never seen before?
It defers to the LLM. An out-of-distribution gate routes anything that does not resemble certified traffic back to the teacher model, so a novel reply is never triaged by the cheap path on a guess.
TRACER is open source. Run pip install tracer-llm, point it
at your traces, and see which replies certify. The hosted version adds a
live meter and one-click activation at
app.tracerml.ai.