The AI jailbreak rubric you can borrow to triage risk
Anthropic, Amazon, Microsoft and Google proposed a shared standard for scoring AI jailbreak severity. The four criteria are a ready-made way to rank your own AI risk.
When "the AI said something it shouldn't" becomes a headline, every incident sounds equally bad. It isn't — and the frontier labs just published a way to tell the difference. Anthropic, together with Amazon, Microsoft, Google and other partners, proposed a shared framework for scoring jailbreak severity. You don't run a frontier lab, but the rubric is a clean way to triage the AI risk in your own stack.
What actually happened
On July 2, Anthropic detailed an industry-wide standard for measuring how serious a given jailbreak — a prompt that bypasses a model's safety guardrails — actually is. The goal is to stop treating every bypass as a five-alarm fire and give developers, and governments, a consistent basis for ranking risk. Four criteria are on the table:
- Capability gain — how much extra the jailbreak unlocks versus what's already available in other models or a search engine.
- Breadth — whether it enables one narrow task or works across many attack types.
- Ease of weaponization — the effort, prompting, and expertise needed to turn it into a real attack.
- Discoverability — how easily an attacker can find or reproduce the technique.
A bypass that reveals something you could Google, once, with expert effort, is not the same as one that's trivial to reproduce and unlocks broad new capability. The framework makes that distinction explicit.
Why it matters for your business
If you've wired an AI agent into your business — answering customers, drafting emails, touching your data — you have a jailbreak surface whether you've named it or not. The useful move isn't to panic at the first weird output. It's to score it. Run any incident through the same four questions: Did it unlock real new capability, or just say something off? Does it generalize or is it a one-off? How hard was it to trigger? Could a customer stumble into it by accident?
That triage tells you where to spend. A high-capability, easy-to-reproduce, broadly-applicable bypass gets patched today. A cosmetic one gets logged and batched. Most small teams either ignore AI security entirely or freeze at every anomaly — the rubric is the middle path, and now it's got four names and industry backing.
The labs are standardizing how they reason about this. Borrow the reasoning. It's free, and it turns "is this bad?" into a question you can actually answer.
Key takeaways
- Anthropic, Amazon, Microsoft and Google proposed a shared standard for scoring AI jailbreak severity (July 2)
- Four criteria: capability gain, breadth, ease of weaponization, and discoverability
- The point is triage — not every bypass is equally serious, and treating them the same wastes effort
- The operator lesson: score your own AI incidents on the same four axes so you patch what's dangerous and batch what's cosmetic
Running an AI agent without a plan for when it misbehaves? We build agent workflows with guardrails, logging, and a human checkpoint where the risk actually lives. See how we build or let's pressure-test your setup.
Sources: Anthropic.
- #ai-security
- #jailbreak
- #risk-management
- #ai-agents
- #governance
Tommy Rush — Founder, Rush Commerce
Operator turned builder. 15+ years running operations — now shipping the systems businesses run on. More
Get The Rush Report weekly — one email, zero fluff.
Keep reading
Venice AI hits $1B on a privacy pitch: your prompts aren't the product
Venice AI raised $65M at a $1B valuation for a privacy-first AI platform that doesn't log prompts — a signal small businesses should read on data ownership.
Read itTogether AI raises $800M — open-source inference just went mainstream
Together AI's $800M round at an $8.3B valuation, led by Aramco's Prosperity7, signals open-weight models are now a serious cost play for small businesses.
Read it