Long-horizon autonomy: agents that work for days, not minutes
Every frontier model release since 2024 has promised "agentic" capability. Claude Fable 5 is the first where the claim survives contact with multi-day work. In Anthropic's launch announcement, the headline capability isn't a benchmark score - it's duration. Fable 5 sustains coherent, goal-directed work across sessions that span hours to days and millions of tokens, without the gradual drift into busywork that capped earlier models at a few hours of useful autonomy.
What changed under the hood
Long-horizon performance isn't one feature; it's the compound effect of several. Fable 5 plans across stages before it acts, decomposing a large goal into ordered phases with explicit checkpoints. It delegates aggressively to sub-agents when work fans out across independent items - dozens of files to migrate, hundreds of test suites to run - then reconciles their results against the plan. And critically, it checks its own work: re-running tests, re-reading diffs, and revising before declaring a stage complete, rather than declaring victory after the first clean pass.
The fourth ingredient is focus. Earlier Claude models lost the thread as context accumulated; instructions from hour one were diluted by hour six. Fable 5 maintains the original task specification as the organizing reference across millions of tokens of intermediate work, aided by server-side compaction and the file-based memory tool covered in our 1M context deep dive. The practical result: a run that fails at step 400 usually fails because the task was underspecified, not because the model forgot what it was doing.
Real-world results from the launch window
The launch-partner stories are the most concrete evidence yet published for any model's long-horizon claims.
| Run | Scope | Outcome |
|---|---|---|
| Stripe codebase migration | 50-million-line Ruby codebase | Completed in 1 day - work Stripe estimated at roughly 2 months for an engineering team |
| Ethan Mollick's game builds | Complete, playable video games from single prompts | Autonomous runs up to 12 hours without human correction |
| Autonomous genomics project | Comparative analysis across 138 species | A week-long unattended run producing a model that beat a published Science result while being 100x smaller |
Stripe's migration is the case worth dwelling on. A 50-million-line migration is not a single clever edit - it is thousands of coordinated changes that must stay mutually consistent, verified continuously against a live test suite. The team's estimate of about two months for human engineers compressed to a single day largely because Fable 5 parallelized the work across sub-agents while keeping one coherent plan, and because it caught its own regressions before they compounded.
Wharton professor Ethan Mollick's experiments point at the other end of the spectrum: zero-scaffold autonomy. He reported complete, working video games generated from single prompts, with the model running unattended for as long as 12 hours - designing, implementing, playtesting via its own tooling, and iterating. The genomics run goes further still: a week of autonomous research across 138 species, ending in a trained model that outperformed a peer-reviewed Science publication at one-hundredth the parameter count. None of these are claims we can independently verify, but they are unusually specific, attributed, and consistent with the model's benchmark profile.
How to get long-horizon behavior in practice
Fable 5's autonomy rewards a different prompting style than chat-era models. The pattern that works:
- Give the full task specification up front. One well-specified initial turn - goal, constraints, definition of done - outperforms the same information dribbled across follow-ups. Ambiguity delivered progressively costs tokens and coherence.
- Run at high effort. Long-horizon coherence comes partly from reasoning more at each step. Start at
effort: "high"(the default) and tune from there; see our adaptive thinking guide for the full parameter reference. - Use Outcomes in Managed Agents. For server-managed sessions, define what "done" looks like as a gradeable rubric via an Outcome. The harness then runs an iterate-grade-revise loop until the artifact meets the rubric - the closest thing today to handing work to a colleague with acceptance criteria.
- Make success checkable. "A CSV with a numeric price column per SKU" beats "a good report." Fable 5 verifies its own work, but only against criteria concrete enough to verify.
output_config.task_budget) gives the model a running token countdown it self-moderates against - worth wiring in before your first overnight run. See pricing explained for the full cost math.Where the limits still are
Long-horizon autonomy does not mean unsupervised deployment. Anthropic shipped Fable 5 under ASL-3 protections with real-time classifiers that can refuse mid-run (details in our safety overview), and an under-resourced agent - missing a credential, a data mount, or a tool - will still discover the gap mid-run and stall. The discipline that pays off is the same one human teams use: reconcile the task against the available resources before kicking off, then let the model run.
The honest summary: Fable 5 doesn't make agents magical. It makes them boring - in the way that reliable infrastructure is boring. You specify, you launch, and the work is usually done when you come back.