Features

Long-horizon autonomy: agents that work for days, not minutes

Every frontier model release since 2024 has promised "agentic" capability. Claude Fable 5 is the first where the claim survives contact with multi-day work. In Anthropic's launch announcement, the headline capability isn't a benchmark score - it's duration. Fable 5 sustains coherent, goal-directed work across sessions that span hours to days and millions of tokens, without the gradual drift into busywork that capped earlier models at a few hours of useful autonomy.

What changed under the hood

Long-horizon performance isn't one feature; it's the compound effect of several. Fable 5 plans across stages before it acts, decomposing a large goal into ordered phases with explicit checkpoints. It delegates aggressively to sub-agents when work fans out across independent items - dozens of files to migrate, hundreds of test suites to run - then reconciles their results against the plan. And critically, it checks its own work: re-running tests, re-reading diffs, and revising before declaring a stage complete, rather than declaring victory after the first clean pass.

The fourth ingredient is focus. Earlier Claude models lost the thread as context accumulated; instructions from hour one were diluted by hour six. Fable 5 maintains the original task specification as the organizing reference across millions of tokens of intermediate work, aided by server-side compaction and the file-based memory tool covered in our 1M context deep dive. The practical result: a run that fails at step 400 usually fails because the task was underspecified, not because the model forgot what it was doing.

Real-world results from the launch window

The launch-partner stories are the most concrete evidence yet published for any model's long-horizon claims.

RunScopeOutcome
Stripe codebase migration50-million-line Ruby codebaseCompleted in 1 day - work Stripe estimated at roughly 2 months for an engineering team
Ethan Mollick's game buildsComplete, playable video games from single promptsAutonomous runs up to 12 hours without human correction
Autonomous genomics projectComparative analysis across 138 speciesA week-long unattended run producing a model that beat a published Science result while being 100x smaller

Stripe's migration is the case worth dwelling on. A 50-million-line migration is not a single clever edit - it is thousands of coordinated changes that must stay mutually consistent, verified continuously against a live test suite. The team's estimate of about two months for human engineers compressed to a single day largely because Fable 5 parallelized the work across sub-agents while keeping one coherent plan, and because it caught its own regressions before they compounded.

Wharton professor Ethan Mollick's experiments point at the other end of the spectrum: zero-scaffold autonomy. He reported complete, working video games generated from single prompts, with the model running unattended for as long as 12 hours - designing, implementing, playtesting via its own tooling, and iterating. The genomics run goes further still: a week of autonomous research across 138 species, ending in a trained model that outperformed a peer-reviewed Science publication at one-hundredth the parameter count. None of these are claims we can independently verify, but they are unusually specific, attributed, and consistent with the model's benchmark profile.

How to get long-horizon behavior in practice

Fable 5's autonomy rewards a different prompting style than chat-era models. The pattern that works:

  • Give the full task specification up front. One well-specified initial turn - goal, constraints, definition of done - outperforms the same information dribbled across follow-ups. Ambiguity delivered progressively costs tokens and coherence.
  • Run at high effort. Long-horizon coherence comes partly from reasoning more at each step. Start at effort: "high" (the default) and tune from there; see our adaptive thinking guide for the full parameter reference.
  • Use Outcomes in Managed Agents. For server-managed sessions, define what "done" looks like as a gradeable rubric via an Outcome. The harness then runs an iterate-grade-revise loop until the artifact meets the rubric - the closest thing today to handing work to a colleague with acceptance criteria.
  • Make success checkable. "A CSV with a numeric price column per SKU" beats "a good report." Fable 5 verifies its own work, but only against criteria concrete enough to verify.
Note: Long runs consume real money. Fable 5 is priced at $10/$50 per million tokens, and a multi-day session can pass through tens of millions. The task budgets beta (output_config.task_budget) gives the model a running token countdown it self-moderates against - worth wiring in before your first overnight run. See pricing explained for the full cost math.

Where the limits still are

Long-horizon autonomy does not mean unsupervised deployment. Anthropic shipped Fable 5 under ASL-3 protections with real-time classifiers that can refuse mid-run (details in our safety overview), and an under-resourced agent - missing a credential, a data mount, or a tool - will still discover the gap mid-run and stall. The discipline that pays off is the same one human teams use: reconcile the task against the available resources before kicking off, then let the model run.

The honest summary: Fable 5 doesn't make agents magical. It makes them boring - in the way that reliable infrastructure is boring. You specify, you launch, and the work is usually done when you come back.

Related reading