Fable 5: 1M Context Window, Memory & Caching

Context windows stopped being a spec-sheet race when they hit a million tokens - the interesting questions became price, retention quality, and what happens when even a million isn't enough. Claude Fable 5 answers all three: a 1M-token context window at standard API pricing with no long-context premium, 128K max output tokens, and a layered system - caching, compaction, context editing, and a file-based memory tool - for work that outgrows any single window.

The headline numbers

Fable 5 accepts up to one million tokens of input - roughly 750,000 words, or a mid-sized codebase with room to spare - and produces up to 128,000 tokens of output per request (streaming required at that scale). Unlike some earlier long-context offerings, there is no surcharge tier: the full window bills at the standard $10/$50 per million input/output tokens. Fable 5 also uses the same tokenizer as Opus 4.8, so token counts, cost models, and count_tokens baselines carry over unchanged from existing Opus 4.8 deployments - one less thing to re-calibrate during migration.

Prompt caching gets cheaper to enter

Long contexts only make economic sense with caching, and Fable 5 lowers the bar. The minimum cacheable prefix drops to 512 tokens (1,024 on Amazon Bedrock) - small enough that even a modest system prompt qualifies, where prior models silently skipped anything under 1-4K tokens.

Cache operation	Price per MTok	vs base input ($10)
Cache write, 5-minute TTL	$12.50	1.25x
Cache write, 1-hour TTL	$20.00	2x
Cache hit (read)	$1.00	0.1x

The break-even math is friendly: at the 5-minute TTL, a prefix read twice already costs less than sending it uncached twice. For an agent looping over a 200K-token codebase context, cache hits at $1/MTok versus $10/MTok uncached are the difference between a viable product and a science experiment. Full worked examples are in pricing explained.

When a million tokens isn't enough

Two mechanisms extend Fable 5 past the physical window. Compaction (supported, beta header) summarizes earlier conversation server-side as the context approaches its limit, returning a compaction block you pass back on subsequent turns - the session keeps going without losing the thread. Context editing (beta) takes the complementary approach: instead of summarizing, it prunes - clearing stale tool results and old reasoning from the transcript based on configurable thresholds, keeping the window lean during long agentic runs.

The third layer is the file-based memory tool, and it is where Fable 5 most clearly outruns its predecessor. The model reads and writes files in a memory directory you host - notes to itself that survive across sessions, not just across turns. Anthropic's launch evaluation used the roguelike game Slay the Spire, where success depends on remembering what worked across many runs: Fable 5 with the memory tool performed 3x better than Opus 4.8 on the same harness. The model is markedly better both at deciding what is worth writing down and at actually consulting its notes later - the two failure modes that made earlier memory implementations underwhelming.

Practical pattern: use all three together. Context editing keeps the live window lean, compaction catches you at the boundary, and the memory tool carries durable state across sessions. Long-horizon agents (see our deep dive) typically wire up the memory tool first - it is the cheapest of the three and pays off on any task longer than one sitting.

Vision: the other half of context

A long context window matters more when the model can fill it with more than text. Fable 5's vision is strong enough to serve as a primary interface, not an accessory. Two launch-window demonstrations stand out. First, Fable 5 completed Pokemon FireRed using a vision-only harness - no RAM reads, no structured game state, just screenshots in and button presses out, sustained across the full game. Second, testers had it rebuild a working web app from screenshots alone, reproducing layout, styling, and interaction behavior from static images of the original. Both exercises stress the same combination: pixel-accurate perception held coherently across a very long session.

For document workloads the implication is straightforward - scanned PDFs, dashboard screenshots, and slide decks can sit in the same million-token window as the code and prose that reference them, and the model treats all of it as one working set.

What it means for builders

The combination changes default architectures. Retrieval pipelines built to squeeze codebases into 200K windows can often be replaced by "put the repository in context and cache it." Session state machines built to work around context loss can hand that job to compaction. And cross-session personalization that previously required a vector database can frequently run on a directory of markdown files the model maintains itself. None of these are mandatory - but on Fable 5 they are the simple option, which is new. The official model documentation has the canonical limits; our Fable 5 vs Opus 4.8 comparison covers when the upgrade pays for itself.

1M context and memory: holding an entire project in working memory

The headline numbers

Prompt caching gets cheaper to enter

When a million tokens isn't enough

Vision: the other half of context

What it means for builders

Related reading