Getting Started with the Claude Fable 5 API
Claude Fable 5 went GA on June 9, 2026, and it is the most capable model Anthropic has ever made generally available - a Mythos-class model made safe for general use. The model ID is claude-fable-5, it ships with a 1M-token context window (no long-context premium), 128K max output tokens, and a January 2026 knowledge cutoff. This guide gets you from zero to a working request, then covers the things that trip up everyone migrating existing code: Fable 5 removes several parameters that worked fine on Opus 4.8.
Install the SDK
Fable 5 is served through the same Messages API as every other Claude model. Use the official SDKs:
# Python
pip install anthropic
# TypeScript / JavaScript
npm install @anthropic-ai/sdk
Set ANTHROPIC_API_KEY in your environment and you're ready.
Your first request
Two things to notice before the code. First, adaptive thinking is always on in Fable 5 - there is no thinking parameter to set, and omitting it runs with thinking. Second, sampling parameters are gone. Do not pass thinking, temperature, top_p, or top_k. The one knob you tune is effort, inside output_config.
Python:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-fable-5",
max_tokens=16000,
output_config={"effort": "high"}, # the default; shown for clarity
messages=[
{"role": "user", "content": "Explain idempotency keys in payment APIs."}
],
)
for block in response.content:
if block.type == "text":
print(block.text)
TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-fable-5",
max_tokens: 16000,
output_config: { effort: "high" },
messages: [
{ role: "user", content: "Explain idempotency keys in payment APIs." },
],
});
for (const block of response.content) {
if (block.type === "text") console.log(block.text);
}
Note that max_tokens now caps thinking plus the visible response. Since thinking is always running, give it more headroom than you would have on a no-thinking Opus 4.8 call. Thinking text is omitted from responses by default (thinking.display defaults to "omitted"); if you surface reasoning to users, request display: "summarized".
What will 400 your old code
If you point existing Opus 4.8 code at claude-fable-5, these are the requests that fail:
| Parameter | Behavior on Fable 5 |
|---|---|
thinking: {"type": "disabled"} | 400 error - thinking cannot be turned off |
thinking: {"type": "enabled", "budget_tokens": N} | 400 error - budget_tokens is removed; use effort |
temperature, top_p, top_k (non-default) | 400 error - sampling parameters are removed |
Assistant prefill (last message has role: "assistant") | 400 error - use structured outputs instead |
Omitting thinking entirely | No error - but the request runs with thinking (on Opus 4.8, omitting meant off) |
The last row is the silent one: an Opus 4.8 codebase that never set thinking will not error on Fable 5, but its latency and token profile will change because every call now reasons. Budget for it.
Stream anything big
Fable 5 supports up to 128K output tokens, and at higher effort the model can spend a lot of them thinking. For any request above roughly 16K max_tokens, use streaming so you don't hit HTTP timeouts:
with client.messages.stream(
model="claude-fable-5",
max_tokens=64000,
output_config={"effort": "xhigh"},
messages=[{"role": "user", "content": "Refactor this module..."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
Effort cheat-sheet
Effort controls how deeply Fable 5 thinks, how it consolidates tool calls, and how verbose it is. Anthropic's guidance: start at high (the default) - lower effort on Fable 5 often exceeds xhigh on prior models.
| Level | Use for |
|---|---|
low | Latency-sensitive routes, simple classification and extraction |
medium | Cost-sensitive workloads that still need real reasoning |
high | Default. Most production work - start here |
xhigh | Capability-sensitive coding and agentic loops |
max | Correctness-over-cost: research, hard migrations, one-shot answers that must be right |
See our effort parameter guide for measured cost per level.
Handling refusals and fallbacks
Fable 5 ships with real-time safety classifiers (categories: cyber, bio, reasoning_extraction). A classifier-stopped response is not an HTTP error - it returns 200 with stop_reason: "refusal" and a stop_details.category:
response = client.messages.create(
model="claude-fable-5",
max_tokens=16000,
fallbacks={"models": ["claude-opus-4-8"]}, # opt-in beta
messages=[{"role": "user", "content": prompt}],
)
if response.stop_reason == "refusal":
print(response.stop_details.category) # "cyber" | "bio" | "reasoning_extraction"
print(response.stop_details.explanation)
The opt-in fallbacks parameter (beta) automatically retries a refused request on Opus 4.8; Anthropic says it triggers in under 5% of sessions, and the fallback credit refunds the cache-switch cost. The official SDKs for TypeScript, Python, Go, Java, and C# also ship refusal-fallback middleware if you'd rather handle it client-side.
Prompt caching
Caching works the same as on other Claude models, with one Fable-specific number: the minimum cacheable prefix is 512 tokens (1,024 on Amazon Bedrock). Shorter prefixes silently won't cache. Cache writes cost $12.50/MTok for the 5-minute TTL or $20/MTok for the 1-hour TTL; cache hits cost $1/MTok - a 90% discount on input.
system=[{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # must be 512+ tokens to cache
"cache_control": {"type": "ephemeral"},
}]
Pricing snapshot
Fable 5 costs $10 per million input tokens and $50 per million output tokens - exactly 2x Opus 4.8 ($5/$25). The Batch API takes 50% off ($5/$25) for anything that can wait up to 24 hours. There is no long-context premium for the 1M window, and Fable 5 is a Covered Model: 30-day retention, no zero-data-retention option. Full breakdown in Fable 5 pricing explained.
That's everything you need for day one: swap in the model ID, delete the dead parameters, start at effort: "high", stream big outputs, and handle stop_reason: "refusal". The rest is tuning.