Choose · Agentic · 2026 getclaw.sh

Tool selection is classification in disguise. So we replaced it.

getclaw runs an agent built on Hermes (an open-source agent framework). For their workload, tool-selection calls were the dominant cost line. We shipped a native Hermes plugin that routes those calls through a TRACER classifier instead of the LLM. End-to-end agent cost dropped ~50% with no degradation. Cost-mix varies by harness; tool-selection is a very common offender.

E2E agent cost −50% end-to-end, in production
measured on real traces
Quality delta 0 no degradation on the
measured traces
Integration 1 plugin native Hermes plugin
no agent rewrite
Hermes + TRACER · tool selection routed through a local classifier · agent loop unchanged
The hidden cost — at getclaw

In Hermes, decisions dominated the bill.

When we instrumented getclaw's loop call-by-call, tool-selection calls were the largest cost line, not the user-facing generation. This isn't universal across agent harnesses, the mix varies with how each framework structures reasoning, retrieval, and generation. But it was the dominant pattern here, and it's a very common one.

Step 1 "Which tool should I use?" LLM call. Output: one of N tools. Pure classification over a small action space.
Step 2 "Do I have what I need?" LLM call. Output: yes/no/needs-clarification. Three-class triage decision.
Step 3 "Should I escalate?" LLM call. Output: continue/handoff/abort. Routing over the next-action space.
Step N "Is this done?" LLM call. Output: done/loop. Binary classifier dressed up as reasoning.

Multiply by the number of steps per task, the number of tasks per day, and the price per token. Tool-selection is where agent inference budgets quietly bleed out, and where the same kinds of decisions repeat thousands of times.

The integration

A native Hermes plugin, registered through Hermes's own interface

Not a wrapper. We built TRACER as a native plugin in Hermes's plugin interface, the same surface Hermes uses for its own internals. Where the framework previously called an LLM to pick the next tool, it now calls the TRACER plugin first, and falls back to the LLM only below the parity threshold:

# before — LLM picks the tool
tool_name = llm.complete(
    prompt=tool_selection_prompt(state),
    tools=available_tools,
).tool

# after — Hermes plugin routes through TRACER first, defers hard cases
decision = hermes.plugins.tracer.route(state)
if decision.confidence >= 0.95:
    tool_name = decision.tool      # local classifier · ~free
else:
    tool_name = llm.complete(...).tool   # defer to teacher LLM

Same classifier core, same parity gate. The agent loop is unchanged. The LLM is still in the stack, it just stops handling decisions a small ML model can handle.

Why it compounds

Agents are the perfect TRACER workload

Three properties make agentic tool selection unusually well-suited:

Repetition The same decisions, again and again An agent doing the same kind of task hits the same tool-selection states thousands of times.
Small action space N is tiny Most agents have fewer than 20 tools. A classifier over 20 classes is a solved problem.
Structured output Tool name + arguments Discrete, bounded, never a paragraph. Exactly what a small ML model is built for.
Already labelled The LLM's own past decisions Every tool call your agent has ever made is a teacher-labelled (state, action) pair. No manual labelling.
The result

Same agent. Same quality. Half the cost.

On the measured traces, end-to-end agent cost dropped ~50% with no observable degradation. The savings compound: every deferred LLM call becomes a new (state, action) trace that retrains the classifier and lets it handle the next case locally.

"Tool selection turned out to be where most of our budget was going. Tracer let us treat it like the classification problem it always was."

Different framework? Different integration surface.

For Hermes we shipped a native plugin. For other agent stacks (LangGraph, CrewAI, custom loops) TRACER ships as an OpenAI-compatible endpoint or an in-process Python handle. Whatever your agent uses to pick the next tool, the routing shape is the same. Start by measuring which calls actually dominate your bill.