Anthropic's fix for flaky AI agents: boring tools

The thing nobody tells you when you build an AI agent: the model is rarely the problem. The problem is what happens when the model has to go get a fact, and the place it gets facts from is inconsistent. Anthropic just published a benchmark that makes this concrete — and the fix is the least glamorous thing in software.

What actually happened

In research on agents in biology, Anthropic built VirBench: 120 realistic viral-sequence queries across 40 pathogens, with manually verified ground-truth answers. They ran a spread of frontier models at it. Without a proper retrieval tool, mean accuracy ranged from 16.9% to 91.3% — and worse, results were wildly unstable. Claude Sonnet 4 returned 106, then 15, then 5 sequences on three runs of the same query. Same model, same question, three answers.

Then they gave the agents one thing: gget virus, a deterministic retrieval tool built with researchers at NCBI. Accuracy jumped above 90% for every agent tested, peaking at 99.7%. Run-to-run variability was largely eliminated, and the gap between "good" and "bad" models nearly closed. Anthropic's own framing: "reliable infrastructure, not model capability alone, determines scientific reliability." The layer underneath the smarts — identifiers, schemas, retrieval logic — has to be boringly, deterministically correct.

Why it matters for your business

Swap "viral sequences" for "your inventory counts," "customer order history," or "which invoices are unpaid," and this is your automation project. When an agent gives a different answer to the same question twice, the instinct is to reach for a smarter model. Usually that's the wrong lever. The flakiness is coming from the tool layer: a loose API, an ambiguous query, a data source that returns different rows depending on the phase of the moon. A frontier model on top of a mushy data layer is a confident, expensive guesser.

The move that actually works is unsexy. Wrap the messy source in a deterministic tool — one that takes a clear input and returns the same correct output every time — and hand the agent that. Then the model does what it's good at (deciding what to ask, interpreting the answer) and the tool does what it's good at (being right, repeatably). This is exactly where most "the AI keeps getting it wrong" problems we see actually live, and it's why we spend more time on the boring data plumbing than on prompt-tuning. A cheaper model with a reliable tool beats a frontier model with a flaky one — and it's cheaper to run, too.

Key takeaways

Anthropic's VirBench: without a deterministic tool, biology agents scored 16.9%–91.3% and gave different answers to identical queries
Adding one deterministic retrieval tool (gget virus, built with NCBI) pushed every agent above 90%, peaking at 99.7%, and killed the run-to-run variance
Agent flakiness usually comes from the tool/data layer, not the model — a bigger model won't fix a mushy source
Wrap messy sources in deterministic tools; a cheaper model with a reliable tool beats a frontier model with a flaky one

Got an AI feature that keeps getting it wrong? We fix the tool and data layer underneath the model — the part that makes automation actually reliable. See what we've shipped or tell us where it's flaky.

Sources: Anthropic Research.

Anthropic's fix for flaky AI agents: boring tools

What actually happened

Why it matters for your business

Keep reading

Neo bets $30M that you can't bolt AI onto old software

Cursor for iOS: coding agents you run from your phone