Analysis

Claude Fable 5 Benchmarks: The Complete Picture

Anthropic published an unusually broad evaluation suite with the Fable 5 launch, spanning software engineering, knowledge work, computer use, and domain-specific tests in law, health, biology, and cybersecurity. Fable 5 sets the state of the art on nearly all of them - on several by more than 10 points over Opus 4.8 - while GPT-5.5 keeps two notable crowns. Here is every published number, organized by category, with analysis and the caveats that should accompany any launch-day scorecard.

Unless noted, comparisons are Claude Fable 5 / Claude Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro. A dash means no published score.

Coding and agentic engineering

BenchmarkFable 5Opus 4.8GPT-5.5Gemini 3.1 Pro
SWE-Bench Pro80.369.258.654.2
FrontierCode Diamond29.313.45.7 -
Terminal-Bench 2.188.082.783.470.7

SWE-Bench Pro is the headline: an 11.1-point jump over Opus 4.8 and a 21.7-point lead over GPT-5.5 on realistic software-engineering tasks. But FrontierCode Diamond is the more interesting result. It is designed to be brutally hard - problems at the edge of what working engineers can do - and Fable 5's 29.3 is 2.2x Opus 4.8's score and more than 5x GPT-5.5's. On Terminal-Bench, note that GPT-5.5 actually edged Opus 4.8 (83.4 vs 82.7); Fable 5 retakes the lead with room to spare.

Knowledge work and reasoning

BenchmarkFable 5Opus 4.8GPT-5.5Gemini 3.1 Pro
GDPval-AA (Elo)1932189017691314
Humanity's Last Exam (no tools)59.049.841.444.4
Humanity's Last Exam (with tools)64.557.952.251.4
GDP.pdf29.822.524.916.7

GDPval-AA deserves a word of explanation, because Elo is not a percentage. The benchmark pits models against each other on real economically valuable knowledge work - the kind of deliverables professionals are paid to produce - and scores them like chess players, by win rate in head-to-head comparisons. A 163-point Elo gap (Fable 5 vs GPT-5.5) implies Fable 5's output is preferred well over 70% of the time; the 618-point gap to Gemini 3.1 Pro is a different league entirely. For anyone evaluating these models for report-writing, analysis, or document work, this is arguably the most decision-relevant number on the page.

The Humanity's Last Exam jump (59.0 no-tools, up 9.2 points) is also notable because it cannot be attributed to tooling - that is raw knowledge and reasoning.

Computer use and vision

BenchmarkFable 5Opus 4.8GPT-5.5Gemini 3.1 Pro
OSWorld-Verified85.083.478.776.2
Blueprint-Bench 238.614.536.226.5

OSWorld-Verified (driving a real computer through its GUI) shows a modest 1.6-point gain - this was already a Claude strength. Blueprint-Bench 2, which tests interpretation of technical drawings, is the opposite story: Opus 4.8 was weak at 14.5, GPT-5.5 led the field, and Fable 5 jumps 2.7x to take the top spot. Anecdotally, the vision gains track with launch demos: Fable 5 completed Pokemon FireRed using vision alone and rebuilt a working web app from nothing but screenshots.

Domain-specific: legal, health, biology, cyber

BenchmarkFable 5Opus 4.8GPT-5.5Gemini 3.1 Pro
Legal Agent Benchmark13.310.42.10.0
HealthBench Professional66.056.951.8 -
BioMysteryBench (hard)46.140.0 - -
ExploitBench78.040.034.0 -

The Legal Agent Benchmark numbers look tiny, but read them as a difficulty signal: Gemini 3.1 Pro scores zero and GPT-5.5 barely registers, while Fable 5's 13.3 is supported by qualitative evidence - a blind review by Davis Polk lawyers called its work "materially different." ExploitBench is the eye-opener: 78.0 against Opus 4.8's 40.0, nearly doubling the previous Claude score. That result is precisely why the unrestricted Mythos 5 configuration is limited to vetted cyberdefense partners.

Where GPT-5.5 still leads

OpenAI keeps two significant leads: ARC-AGI-2 (85.0) and GPQA Diamond (94.4). The pattern is coherent. Both reward abstract, puzzle-like reasoning - novel pattern induction and graduate-level science questions answered in a single sitting. Fable 5's wins cluster around agentic execution: long tasks, tools, real environments, and self-verification across many steps. The frontier, in other words, has split into specializations: GPT-5.5 remains the strongest pure abstract reasoner on these tests, while Fable 5 is the strongest at sustained, goal-directed work. Which matters more depends entirely on your workload.

Honest caveats

Read launch benchmarks skeptically. These are vendor-reported numbers, published by Anthropic on launch day, with harness details and effort settings under Anthropic's control. Competitor scores may not reflect those vendors' best configurations.
  • Saturation debate. Several classic benchmarks are near ceiling for all frontier models, which is why newer, harsher tests (FrontierCode Diamond, GDP.pdf, Legal Agent Benchmark) dominate this cycle - and those have shorter track records and less independent scrutiny.
  • Missing cells. Gemini 3.1 Pro has no published score on several rows; absence of a number is not a zero (except where 0.0 is the actual score).
  • Benchmarks are not workloads. Early customer reports - Cursor calling it "state of the art on CursorBench," Hebbia's "first to break 90% on our core analytics benchmark" - are encouraging precisely because they are independent evals, but your own tasks remain the test that matters.

With those caveats logged, the overall shape is hard to argue with: across thirteen published evaluations spanning four categories, Fable 5 leads on every one Anthropic reported, often by the largest single-generation margins in recent memory. For what that means for your wallet, see our pricing breakdown.

Related reading