Anthropic's fix for flaky AI agents: boring tools
A single deterministic retrieval tool took biology-agent accuracy from 16.9% to 99.7%. The operator lesson: give agents real tools, not a bigger model.
The thing nobody tells you when you build an AI agent: the model is rarely the problem. The problem is what happens when the model has to go get a fact, and the place it gets facts from is inconsistent. Anthropic just published a benchmark that makes this concrete — and the fix is the least glamorous thing in software.
What actually happened
In research on agents in biology, Anthropic built VirBench: 120 realistic viral-sequence queries across 40 pathogens, with manually verified ground-truth answers. They ran a spread of frontier models at it. Without a proper retrieval tool, mean accuracy ranged from 16.9% to 91.3% — and worse, results were wildly unstable. Claude Sonnet 4 returned 106, then 15, then 5 sequences on three runs of the same query. Same model, same question, three answers.
Then they gave the agents one thing: gget virus, a deterministic retrieval tool built with researchers at NCBI. Accuracy jumped above 90% for every agent tested, peaking at 99.7%. Run-to-run variability was largely eliminated, and the gap between "good" and "bad" models nearly closed. Anthropic's own framing: "reliable infrastructure, not model capability alone, determines scientific reliability." The layer underneath the smarts — identifiers, schemas, retrieval logic — has to be boringly, deterministically correct.
Why it matters for your business
Swap "viral sequences" for "your inventory counts," "customer order history," or "which invoices are unpaid," and this is your automation project. When an agent gives a different answer to the same question twice, the instinct is to reach for a smarter model. Usually that's the wrong lever. The flakiness is coming from the tool layer: a loose API, an ambiguous query, a data source that returns different rows depending on the phase of the moon. A frontier model on top of a mushy data layer is a confident, expensive guesser.
The move that actually works is unsexy. Wrap the messy source in a deterministic tool — one that takes a clear input and returns the same correct output every time — and hand the agent that. Then the model does what it's good at (deciding what to ask, interpreting the answer) and the tool does what it's good at (being right, repeatably). This is exactly where most "the AI keeps getting it wrong" problems we see actually live, and it's why we spend more time on the boring data plumbing than on prompt-tuning. A cheaper model with a reliable tool beats a frontier model with a flaky one — and it's cheaper to run, too.
Key takeaways
- Anthropic's VirBench: without a deterministic tool, biology agents scored 16.9%–91.3% and gave different answers to identical queries
- Adding one deterministic retrieval tool (gget virus, built with NCBI) pushed every agent above 90%, peaking at 99.7%, and killed the run-to-run variance
- Agent flakiness usually comes from the tool/data layer, not the model — a bigger model won't fix a mushy source
- Wrap messy sources in deterministic tools; a cheaper model with a reliable tool beats a frontier model with a flaky one
Got an AI feature that keeps getting it wrong? We fix the tool and data layer underneath the model — the part that makes automation actually reliable. See what we've shipped or tell us where it's flaky.
Sources: Anthropic Research.
- #ai-agents
- #reliability
- #anthropic
- #tool-use
- #automation
Tommy Rush — Founder, Rush Commerce
Operator turned builder. 15+ years running operations — now shipping the systems businesses run on. More
Get The Rush Report weekly — one email, zero fluff.
Keep reading
Neo bets $30M that you can't bolt AI onto old software
Bhavin Turakhia is self-funding Neo, an AI-native, model-agnostic rival to Office and Workspace. The real lesson for operators: when to rebuild vs. bolt on.
Read itCursor for iOS: coding agents you run from your phone
Cursor shipped a native iOS app that fires off cloud coding agents to merge-ready PRs. What a small studio actually gets — and where the human checkpoint still goes.
Read it