Cut LLM cost on repeated workflows

A lightweight ML model for your repeated LLM calls.

Trained on the LLM traces you already have. Answers the repeated calls for near-zero cost. Your prompts and providers stay the same.

Your traffic, carved into routable zones Tracer dashboard: density heatmap of 38 partition cells. Blue zones are ML-handled traffic, orange cells defer to a cheap LLM menu, red zone defers to the frontier teacher.
lightweight ML cheap LLM frontier defer 95% of frontier calls removed · 99.9% routed accuracy
In production at
The problem

Your product works. The inference margin does not.

Repeated decisions don't need a frontier model. Tracer finds those calls and handles them with a lightweight classifier, without changing your providers or your prompts.

Use cases

One frontier-LLM call, repeated thousands of times a day, with structured output.

classify score moderate route match screen triage validate

If you can write "we use a frontier LLM to [verb above] something that happens thousands of times per day," Tracer fits. What matters is the shape of the workflow. Live at Obside, getclaw, and three more in production.

How it works

Take the repeated decisions off the frontier model.

Tracer learns from your traces and decides, per cluster, whether a lightweight ML model is enough. Predictable calls leave the LLM stack entirely. Ambiguous ones stay on your provider, untouched. Every offload is parity-gated against your teacher.

Of 100% incoming traffic →
Predictable
62%
→ surrogate (lightweight ML)  ·  ~$0 / call
Mid-difficulty
24%
→ smaller LLM (optional tier)
Ambiguous / OOD
14%
→ teacher LLM  ·  full cost
Distribution above is illustrative · actual coverage is measured per workload.

Cost & latency

The two metrics that matter, both crushed.

Per-call cost and per-call latency are how a routing layer is judged in production. TRACER cuts both by orders of magnitude on the predictable slice, without giving up accuracy, because the parity gate is doing the work.

Cost per call on the predictable slice
teacher LLM $0.005 ~$5 / 1,000 calls
ML surrogate $0.000001 ~$1 / 1,000,000 calls
on the routed 90% → ~5,000× cheaper per call
Latency per call P50, no network jitter
teacher LLM ~820 ms API round-trip + generation
ML surrogate <10 ms CPU classifier · co-located
on the routed 90% → ~80× faster, often sub-10ms

Numbers above are demo-workload measurements (Banking77, gpt-5.5 teacher, BGE-M3 + linear surrogate). Your mileage scales with how predictable your traffic actually is.

How it works

Your LLM traces become free training data.

Every classification call your LLM already makes is a labeled input-output pair. TRACER fits a surrogate on those, calibrates a confidence gate, and only routes when the gate is happy.

01

Log traces

Every LLM classification call produces a labeled input-output pair, already in your logs. No manual labeling.

02

Fit a surrogate

tracer.fit() trains ML candidates, picks the best, calibrates a confidence gate against the teacher.

03

Route and save

Easy inputs handled by the surrogate (free). Hard inputs deferred to the teacher (paid). Coverage compounds.

Create your Tracer

Pick task → upload traces → see savings → ship.

setup wizard · under 10 min

Hosted Tracer gives you an HTTPS endpoint and an audit dashboard. Call it directly, or wire it into your agent through a native plugin for the harness you already use. Your provider keys stay with you.

01 · Pick task

Define labels

Classification, scoring, moderation, routing, matching. Tell Tracer what the LLM is currently deciding.

02 · Upload traces

Use your logs

JSONL upload, warehouse connector, or sample dataset. Every existing LLM call becomes a labeled training pair.

03 · Savings estimate

See the offload

Tracer measures, per cluster, how much traffic can leave your LLM provider at your quality bar.

04 · Deploy

Live endpoint

Production HTTPS endpoint, API key, audit dashboard. cURL it from anywhere or wire the agent plugin.

your endpoint, after setup
curl -X POST https://api.tracerml.ai/{project}/classify \
  -H "Authorization: Bearer trc_..." \
  -d '{"input":"How do I change my PIN?"}'

 {"label": "change_pin", "decision": "handled", "accept_score": 0.96, "model": "tracer_surrogate"}

Integrate

One endpoint, or a native plugin for your agent.

As an endpoint live

HTTP, OpenAI-compatible

Drop a single URL into your stack. Each request returns a label, an accept-score, and a decision. Use it anywhere you currently POST to an LLM for a structured answer: support inbox, moderation pipeline, RAG pre-filter, extraction worker.

endpoint: POST /classify · returns label + accept_score + decision
In your agent loop live

Native plugin or in-process handle

Wire Tracer into the step where your agent currently calls an LLM to pick the next tool. Native plugins for popular agent frameworks; everything else gets an OpenAI-compatible endpoint or an in-process Python handle. No agent rewrite.

plugins: Hermes · LangGraph · CrewAI · your own

In production

We replaced tool-calling with ML in Hermes. Cost dropped 50%.

We shipped a native plugin for Hermes (an open-source agent framework) that routes tool selection through a Tracer classifier. End-to-end agent cost dropped about 50% on the measured traces, with no degradation. At this harness tool selection was the dominant cost line, which varies across agent stacks.

live · hermes + tracer
Case · Hermes (agentic)

Same agent. ~50% cheaper.

Tool selection inside an agent is a classification problem. Once the agent has run for a while, the same tool-call decisions repeat. TRACER learns them, gates by parity, and routes the predictable ones to a lightweight classifier. For free.

E2E cost −50%
Tool-call latency ↓ ↓ ↓
running in prod inference at getclaw.sh Read the full case study →

Trust

Every offload is explained, and verifiable.

Tracer doesn't ask you to trust a black-box score. Each routing decision is grounded in a cluster you can read, a measured per-model track record you can verify, and a parity gate that refuses to route when the surrogate isn't ready.

Cluster card one of 60 clusters in this workload

Read what the surrogate handles.

Every cluster card shows what kind of queries it covers, how each model performs on them, and which model TRACER picked. A domain expert can read it and predict the routing decision without running the system.

Handled by ML surrogate
PIN & card operations
"How do I change my PIN?"
Per-model accuracy on this cluster How often each model picks the right label on these queries. The cheapest model that meets the parity bar wins.
surrogate
96%
mini
97%
teacher
98%
Similar queries in this cluster
  • "How do I reset my PIN?"
  • "I forgot the PIN of my card"
  • "Can I pick a new PIN online?"
  • "My PIN is wrong, what do I do?"
  • "Update PIN for my Apple Pay card"
Partition map 60 clusters · grouped by topic

See where each query lands.

Every cluster projected to 2D, colored by routing tier and labeled by topic. Live queries pin onto the map. Out-of-scope queries land in the red region and escalate to the teacher automatically.

surrogate · 38 clusters · 62% mini · 14 clusters · 24% teacher · 8 clusters · 14% live query

Pricing

Pay only when we measurably save you money.

No seat licenses. No markup on inference. Your provider invoices stay between you and your provider. We only get paid against savings we can prove on your traffic.

Zero markup on LLMs

Tracer sits on the inference path so we can measure savings. Provider tokens pass through at cost. We make zero margin on inference.

Bring your own keys

Plug in the provider keys you already use. Switch providers anytime. Tracer's offload doesn't lock you into one model menu.

20% of verified savings

We measure how much frontier-LLM spend Tracer removes on your workload. You pay 20% of that, net of your unchanged provider bill.

Open source, always free

The Tracer SDK (pip install tracer-llm) is MIT-licensed. Self-host the whole routing core if you'd rather not pay anything.

vs. everything else

How is this different from caching, smaller LLMs, or model routers?

Most LLM cost tools keep the request inside the LLM cost structure. TRACER routes predictable slices out of it entirely, gated by measured parity so you never silently degrade quality.

Approach What it does Where it falls short
Caching Reuses identical responses Only works when requests repeat exactly.
Prompt optimization Cuts tokens per call Request still goes through the LLM.
Smaller LLMs Cheaper per call Still orders of magnitude more than CPU-class ML at high volume.
Fine-tuning Specializes one model Heavier to maintain. Still inside the LLM cost structure.
Model routers Picks which LLM Never asks "do we need an LLM at all?".
TRACER Routes predictable slices to lightweight ML Customer-trained. Parity-gated. Interpretable.

Safety

Parity gate · deploy only when safe

The surrogate goes live only when its agreement with the teacher exceeds your threshold on held-out data. If the task is too hard (hallucination detection, compositional NLI), TRACER correctly refuses to route. No silent degradation.

# the gate checks before promoting
result = tracer.fit(
    "traces.jsonl",
    embeddings=X,
    config=tracer.FitConfig(
        target_teacher_agreement=0.95
    ),
)
# result.manifest.coverage_cal -> 0.92
# result.manifest.method        -> "l2d"

Continual learning

The teacher-trace flywheel

Every deferred LLM call produces a new labeled trace at no extra cost. tracer.update() refits the surrogate on the expanded dataset. Coverage compounds: by day 4 the surrogate handles 100% of in-distribution traffic.

Coverage by day demo workload
Day 1
43% Day 2
98% Day 4
100%
Each deferred LLM call becomes a new training trace. The cold-start gap closes itself.

Quickstart · OSS

Five minutes to your first routing policy.

Install, run the demo, fit on your traces, serve. No labeling pipeline, no fine-tuning job. Open source, MIT-licensed.

01 · Install
pip install tracer-llm
02 · Demo
tracer demo
03 · Fit
tracer fit traces.jsonl --target 0.95
04 · Serve
tracer serve .tracer --port 8000
Python
import tracer

result = tracer.fit("traces.jsonl", embeddings=X)
router = tracer.load_router(".tracer", embedder=embedder)
out = router.predict("What is my balance?")
# {"label": "check_balance", "decision": "handled", "accept_score": 0.96}
JavaScript / Node.js
const { label, decision } = await fetch('http://localhost:8000/predict', {
  method: 'POST',
  body: JSON.stringify({ embedding })
}).then(r => r.json())

if (decision === 'deferred') label = await callYourLLM(text)

FAQ

Common questions.

What is TRACER?

TRACER is an open-source routing layer that trains a lightweight machine-learning surrogate on your LLM's own production classification traces. It routes the predictable 90 percent of traffic to the surrogate (near-zero cost) and defers only the hard 10 percent back to the LLM. Available as a Python SDK (pip install tracer-llm) or as a one-click hosted endpoint.

How do I reduce LLM costs?

To reduce LLM costs in production, route only the requests that genuinely need an LLM. Most production LLM workloads are repetitive classification tasks (intent detection, content moderation, support triage, tool selection). TRACER trains a small ML surrogate on your existing LLM traces and routes the predictable 90% of traffic to that surrogate at near-zero cost, deferring only the hard 10% back to the LLM. Typical impact: 5,000× cheaper per call on the routed slice and 80× lower latency, with a parity gate guaranteeing quality stays above your threshold. No fine-tuning, no manual labeling required.

What is LLM routing?

LLM routing sends each request to the cheapest model that can answer it correctly. Most model routers pick which LLM to call (frontier vs smaller LLM). Tracer routes predictable requests out of the LLM stack entirely, into a lightweight ML surrogate trained on your own production traces. Routing is gated by measured agreement with your teacher LLM, so quality stays above your threshold. Available as tracer-llm on PyPI or as a hosted multi-tier routing endpoint.

How much does TRACER reduce LLM cost?

On the Banking77 benchmark with 10,000 daily classification calls, TRACER offloaded 92.2 percent of traffic to a lightweight ML surrogate at 0.961 teacher agreement, cutting per-day cost from $44.50 to $3.47, about $14,976 saved per year. Actual savings depend on your workload's predictability; the more repetitive the traffic, the larger the saving.

How is TRACER different from a model router or smaller LLM?

Most LLM cost tools keep the request inside the LLM cost structure: caching only works on exact repeats, prompt optimization shaves tokens, smaller LLMs are still orders of magnitude more expensive than CPU-class ML, and model routers only pick which LLM to call. TRACER routes predictable slices out of the LLM stack entirely, gated by measured agreement (parity) with your teacher LLM so quality never silently degrades.

How does TRACER guarantee quality on the routed traffic?

TRACER deploys a parity gate: the surrogate goes live only when its agreement with the teacher LLM exceeds your threshold (for example 0.95) on held-out calibration data. If a workload is too hard, TRACER refuses to route it and everything stays on the LLM. Every routing decision exposes the matched cluster, the per-model accuracy on that cluster, and the confidence bound, fully auditable.

What kinds of workloads does TRACER work for?

TRACER targets repetitive LLM classification workloads: intent classification, content moderation, compliance scanning, support triage, document extraction, eval pipelines, and per-step tool selection in agentic workflows. Anywhere the same kinds of decisions happen many times a day, TRACER finds the predictable slices.

How long does it take to deploy TRACER?

On the hosted version, the setup wizard is six steps: pick your task, point to your traces, choose embeddings, pick your model menu, set a quality target, and get a live HTTPS endpoint. The build runs in the background and takes minutes (not days) depending on dataset size. With the open-source SDK, the equivalent is pip install tracer-llm followed by tracer fit traces.jsonl --target 0.95 and tracer serve.

Is TRACER open source?

Yes. The TRACER routing core is MIT-licensed and available on GitHub at github.com/adrida/tracer and on PyPI as tracer-llm. The hosted version layers managed infrastructure (managed embeddings, hosted endpoint, monitoring, audit dashboard) on top of the same OSS core.

Do I need to label my training data?

No. Every classification call your LLM already makes is a labeled (input, output) pair already in your logs. Tracer fits the surrogate directly on these traces with no manual labeling. As traces accumulate the surrogate refits and coverage compounds: 43% on day 1, 98% on day 2, 100% by day 4 in the demo workload.

How do AI SDR, sales-AI, and GTM tools reduce LLM costs with Tracer?

AI SDR and GTM platforms hit the same frontier-LLM call millions of times: lead scoring, intent classification on inbound replies, account triage, outbound-personalization categorization. Tracer trains a lightweight classifier on the calls your stack is already paying for and answers the repeated ones for near-zero cost. Typical impact on the routed slice: 70 to 95 percent lower per-call cost, sub-10ms latency, and a parity gate that refuses to route when the workload is too hard. Your prompts and providers stay the same.

How do AI HR and recruitment platforms cut LLM inference cost?

Resume screening, candidate-job matching, and skill extraction are exactly the workflow shape Tracer is built for: one structured LLM decision, repeated across every applicant and every requisition. Tracer learns from your existing teacher-LLM traces, routes the predictable slice to a lightweight ML surrogate, and defers ambiguous cases back to the LLM. Live deployments in production at Obside (intent-news matching, 95% saved vs GPT-5) and getclaw (agent tool selection, ~50% end-to-end cut) show the pattern transfers across verticals.

How do agentic systems and AI agents reduce per-step LLM cost?

Inside an agent loop, tool selection, planner-executor routing, and safety classification are repeated decisions an LLM does not need to make every time. Tracer ships a native plugin for Hermes (and an OpenAI-compatible endpoint for any other harness) that routes these decisions through a lightweight classifier trained on your agent's own traces. In production at getclaw.sh, end-to-end agent cost dropped about 50 percent on measured traces with no quality degradation.

What is the best way to reduce LLM cost on high-volume moderation, compliance, and screening?

Content moderation, abuse detection, KYC scoring, and compliance scanning all share the same shape: a single LLM verdict, repeated thousands of times per day. Tracer trains a small classifier on your existing moderation traces, routes the obvious cases locally for free, and reserves the frontier LLM for ambiguous or out-of-distribution inputs. The parity gate guarantees the surrogate only ships when measured agreement with your teacher clears your threshold (e.g. 0.95), so quality stays above the bar.

Who is behind Tracer AI?

Tracer AI is built by Adam Rida and the DeepRecall team. Adam holds a PhD in machine learning and is a repeat founder. The Tracer routing method is documented in a research paper featured on Hugging Face (huggingface.co/papers/2604.14531, arXiv 2604.14531). Tracer is already running in production at Obside (French fintech, automated trading) and getclaw (agent infrastructure), with three additional deployments announcing soon.

How is Tracer different from fine-tuning, OpenAI distillation, or running a smaller LLM?

Fine-tuning and OpenAI's distillation keep the request inside the LLM cost structure: you still pay per-token rates on a model that needs a GPU. A smaller LLM is still orders of magnitude more expensive than a CPU-class classifier at scale. Tracer routes the predictable slice out of the LLM stack entirely, into a lightweight ML surrogate trained on your own traces. The parity gate measures surrogate-teacher agreement and refuses to route when the workload does not meet your quality threshold, so quality stays above the bar.

What does Tracer AI cost? Is there a markup on LLM calls?

Zero markup on LLM inference. Tracer sits on the inference path so we can measure savings, but provider tokens pass through at cost. You bring your own keys (OpenAI, Anthropic, Bedrock, self-hosted) and switch providers anytime. Tracer charges 20 percent of the frontier-LLM spend we measurably remove from your bill. No seat licenses. The open-source SDK (pip install tracer-llm) is MIT-licensed and free forever.

Find out what your inference bill could be.

Send a sample of traces. We'll come back with a per-cluster offload estimate and a verified-savings projection. No commitment, no provider switching, no markup.

See how much you're wasting →