Tool selection is classification in disguise. So we replaced it.
getclaw runs an agent built on Hermes (an open-source agent framework). For their workload, tool-selection calls were the dominant cost line. We shipped a native Hermes plugin that routes those calls through a TRACER classifier instead of the LLM. End-to-end agent cost dropped ~50% with no degradation. Cost-mix varies by harness; tool-selection is a very common offender.
measured on real traces
measured traces
no agent rewrite
In Hermes, decisions dominated the bill.
When we instrumented getclaw's loop call-by-call, tool-selection calls were the largest cost line, not the user-facing generation. This isn't universal across agent harnesses, the mix varies with how each framework structures reasoning, retrieval, and generation. But it was the dominant pattern here, and it's a very common one.
Multiply by the number of steps per task, the number of tasks per day, and the price per token. Tool-selection is where agent inference budgets quietly bleed out, and where the same kinds of decisions repeat thousands of times.
A native Hermes plugin, registered through Hermes's own interface
Not a wrapper. We built TRACER as a native plugin in Hermes's plugin interface, the same surface Hermes uses for its own internals. Where the framework previously called an LLM to pick the next tool, it now calls the TRACER plugin first, and falls back to the LLM only below the parity threshold:
# before — LLM picks the tool tool_name = llm.complete( prompt=tool_selection_prompt(state), tools=available_tools, ).tool # after — Hermes plugin routes through TRACER first, defers hard cases decision = hermes.plugins.tracer.route(state) if decision.confidence >= 0.95: tool_name = decision.tool # local classifier · ~free else: tool_name = llm.complete(...).tool # defer to teacher LLM
Same classifier core, same parity gate. The agent loop is unchanged. The LLM is still in the stack, it just stops handling decisions a small ML model can handle.
Agents are the perfect TRACER workload
Three properties make agentic tool selection unusually well-suited:
Same agent. Same quality. Half the cost.
On the measured traces, end-to-end agent cost dropped ~50% with no observable degradation. The savings compound: every deferred LLM call becomes a new (state, action) trace that retrains the classifier and lets it handle the next case locally.
"Tool selection turned out to be where most of our budget was going. Tracer let us treat it like the classification problem it always was."
Different framework? Different integration surface.
For Hermes we shipped a native plugin. For other agent stacks (LangGraph, CrewAI, custom loops) TRACER ships as an OpenAI-compatible endpoint or an in-process Python handle. Whatever your agent uses to pick the next tool, the routing shape is the same. Start by measuring which calls actually dominate your bill.