Fable 5 Safety - ASL-3, Classifiers & Fallbacks

Claude Fable 5 exists because of its safety system, not despite it. Anthropic describes Fable 5 as "a Mythos-class model made safe for general use" - the same underlying intelligence as Claude Mythos 5, wrapped in deployment safeguards strict enough to release publicly. The model shipped under ASL-3 (AI Safety Level 3) protections and is classified CB-1, the first public Claude to carry that designation. Understanding the safeguards matters for anyone building on the API, because they are visible in responses, billing, and retention policy.

Three real-time classifiers

Every Fable 5 request and response passes through three classifier safeguards that run in real time, alongside generation:

Cyber - blocks assistance with offensive cyber operations: exploit development, intrusion tooling, and attack planning beyond defensive or educational scope.
Bio/chem - blocks uplift toward biological and chemical weapons, the core ASL-3 concern.
Reasoning-extraction - an anti-distillation safeguard that blocks systematic attempts to extract the model's raw reasoning at scale, protecting the chain-of-thought from being harvested to train competitor models.

The third is novel. Cyber and bio classifiers have precedents in earlier Claude releases; a classifier whose job is to stop other labs from distilling the model is new, and it is part of why Fable 5's thinking text defaults to omitted (see our adaptive thinking guide).

How refusals surface in the API

A classifier-triggered refusal is not an HTTP error. The request returns HTTP 200 with stop_reason: "refusal" and a structured stop_details object whose category field is one of "cyber", "bio", "reasoning_extraction", or null (for model-initiated refusals outside the three classifier domains). Refusals caught at the prompt stage are not billed.

response = client.messages.create(model="claude-fable-5", ...)

if response.stop_reason == "refusal":
 category = response.stop_details.category
 # "cyber" | "bio" | "reasoning_extraction" | None
 print(response.stop_details.explanation)

Handle this branch explicitly. Retrying the same prompt will not help; surfacing the explanation to the user usually does.

The fallbacks parameter

For production workloads that cannot tolerate mid-session refusals, Anthropic ships an opt-in fallbacks parameter (beta, available on the Claude API and Claude Platform on AWS). When a Fable 5 classifier blocks a request, the API automatically retries it on Opus 4.8 - a model outside the Mythos-class safeguard regime - and returns that response instead. Key facts:

Aspect	Detail
Availability	Beta - Claude API and Claude Platform on AWS; opt-in per request
Fallback model	Claude Opus 4.8
Trigger rate	Under 5% of sessions, per Anthropic's launch figures
Cost handling	A "fallback credit" refunds the prompt-cache switching cost incurred by the model change

The fallback credit detail matters more than it looks: switching models normally invalidates your prompt cache, and re-writing a 500K-token cached prefix on Opus 4.8 would otherwise turn a single refusal into a multi-dollar event.

Red-teaming and retention

Anthropic reports more than 1,000 hours of external red-teaming against the deployed safeguard stack, with no universal jailbreaks found - no single technique that reliably defeats the classifiers across domains. The findings are documented in the joint Fable 5 / Mythos 5 system card, the first time Anthropic has published one system card covering two models.

One operational consequence developers should plan for: Fable 5 is designated a "Covered Model," which carries mandatory 30-day data retention. Zero-data-retention agreements do not apply to Fable 5 traffic - organizations with strict ZDR requirements will need to keep those workloads on Opus 4.8 or earlier models for now.

The false-positive debate

Anthropic is unusually candid that the classifiers overreach. The system card states they are "stricter than would be ideal... This will be frustrating to some users." The community confirmed it within hours of launch: security researchers reported legitimate defensive work - malware analysis write-ups, CTF solutions, patch diffing - tripping the cyber classifier, and the threads on Hacker News collected dozens of similar reports. The criticism is fair, and so is the context: a sub-5% session trigger rate with an automatic fallback path is a meaningfully better experience than the hard refusals of earlier safety systems, and Anthropic has said classifier precision will improve with deployment data. Where you land likely depends on whether your workload sits near the cyber boundary.

Building near the boundary? Opt into fallbacks, log stop_details.category on every refusal, and route persistent cyber-domain work to Opus 4.8 directly rather than fighting the classifier. Prompt-stage refusals cost nothing, so instrumenting the failure mode is cheap.

Mythos 5: the same model, fewer locks

Claude Mythos 5 is the same model as Fable 5 with the cyber safeguards lifted, available exclusively to vetted Project Glasswing partners - government agencies and approved security firms doing authorized offensive-security work. It is not on the public API and there is no waitlist. The full story of the model Anthropic initially considered too dangerous to release is in our explainer, What is Claude Mythos 5?, and the official details are in Anthropic's announcement.

Mythos-class safety: the safeguards that made Fable 5 releasable