How to cut LLM costs on email reply prioritization

Sales and support teams run an LLM over every inbound reply to decide what to handle now: hot reply now, warm followup, cold nurture, skip. Most of those replies are obvious. Here is how to triage the obvious ones with a near-free classifier and keep the LLM for the judgment calls.

Short answer

Reply prioritization is high-volume and mostly repetitive. Route the clear-cut replies to a small classifier you have certified against your own LLM, defer the ambiguous ones, and you cut the cost of triaging the bulk of your inbox without changing the buckets your team works with.

Why prioritizing every reply with an LLM gets expensive

Inbox tools call a language model on each reply to rank how urgently a human should act. A hard unsubscribe is one call. A clear book-a-meeting is another. A flat not-interested is another. The model is doing real work, and most of that work is the same few rankings over and over. At inbox scale, the triage step alone is a standing bill.

Which replies are safe to triage with a cheap model

Group your past replies by the priority your LLM assigned, then check how consistent each group is on held-out replies. The unambiguous buckets, a hard opt-out or a clear meeting request, form tight regions a small model reproduces. The murky ones, a lukewarm reply that could be a warm followup or a cold nurture, stay on the LLM.

How much can you actually save

The savings equal the share of traffic you can certify, times the price gap between your teacher model and a small classifier. The classifier cost is close to zero next to a frontier call, so the certified share is the number that matters. The reference point on this site is the Obside case study, a clean classification stream where a 38-cell surrogate replaced the frontier call at 95 percent saved.

ReplyBeforeAfter
Certified replyFrontier LLM callSmall classifier, near-zero
Rare or ambiguousFrontier LLM callFrontier LLM call (deferred)

How do you prove quality holds

Each region carries a calibrated lower bound on how often the cheap path will match the teacher, computed on held-out replies. A region only routes to the small model when that bound clears the target you set. Everything else defers. You get an audit trail per region: the dominant label, real examples, and the error bound, so an inbox owner can see why a region is safe before any real traffic moves. For why this matters to your unit economics, see the AI margin problem.

How to cut the cost, step by step

What you need: a few thousand recent replies, each paired with the label your LLM already produced. No hand-labelling. Your own traffic is the training signal.

  1. Export your reply-triage traces

    Pull recent replies paired with the priority your LLM produced: hot reply now, warm followup, cold nurture, skip.

  2. Build the partition

    Run pip install tracer-llm and fit on your traces. TRACER groups replies by the priority the LLM assigned, then learns where a new reply lands.

  3. Read the certified buckets

    See which priority buckets clear your target agreement on held-out replies. Each region shows its dominant priority, real example replies, and its error bound.

  4. Activate the clear-cut buckets

    Route the certified buckets to the small classifier and keep nuanced replies on the LLM. The out-of-distribution gate sends unfamiliar replies back to the teacher model.

  5. Meter and re-certify

    Track live coverage, savings, and agreement. As your inbound mix changes, re-fit so the guarantee keeps holding.

The open-source library runs the whole flow locally. The hosted version adds a live meter, a savings estimate, and one-click activation once a region is certified, at app.tracerml.ai.

Frequently asked questions

Can a small model triage replies as well as an LLM?

For the clear-cut replies, yes. A hard opt-out or an obvious meeting request is unambiguous, and a small classifier handles it for near-zero cost. The replies that genuinely need judgment defer to the LLM. You only route a reply to the cheap model when a calibrated accuracy bound clears your target.

How much does reply triage with an LLM cost at scale?

Every inbound reply triggers a call to rank its urgency. At thousands of replies a day, the triage step alone becomes a meaningful line item. Moving the clear-cut share to a classifier removes that cost from the bulk of the volume.

Does this work with my own priority buckets?

Yes. TRACER learns whatever buckets your LLM already produces, for example hot reply now, warm followup, cold nurture, and skip. It uses your past decisions as the training signal, so the buckets match what your team already uses.

What happens to a reply the model has never seen before?

It defers to the LLM. An out-of-distribution gate routes anything that does not resemble certified traffic back to the teacher model, so a novel reply is never triaged by the cheap path on a guess.

TRACER is open source. Run pip install tracer-llm, point it at your traces, and see which replies certify. The hosted version adds a live meter and one-click activation at app.tracerml.ai.

← All posts