Claude Fable 5 Benchmarks: The Complete Picture

Anthropic published an unusually broad evaluation suite with the Fable 5 launch, spanning software engineering, knowledge work, computer use, and domain-specific tests in law, health, biology, and cybersecurity. Fable 5 sets the state of the art on nearly all of them - on several by more than 10 points over Opus 4.8 - while GPT-5.5 keeps two notable crowns. Here is every published number, organized by category, with analysis and the caveats that should accompany any launch-day scorecard.

Unless noted, comparisons are Claude Fable 5 / Claude Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro. A dash means no published score.

Coding and agentic engineering

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
SWE-Bench Pro	80.3	69.2	58.6	54.2
FrontierCode Diamond	29.3	13.4	5.7	-
Terminal-Bench 2.1	88.0	82.7	83.4	70.7

SWE-Bench Pro is the headline: an 11.1-point jump over Opus 4.8 and a 21.7-point lead over GPT-5.5 on realistic software-engineering tasks. But FrontierCode Diamond is the more interesting result. It is designed to be brutally hard - problems at the edge of what working engineers can do - and Fable 5's 29.3 is 2.2x Opus 4.8's score and more than 5x GPT-5.5's. On Terminal-Bench, note that GPT-5.5 actually edged Opus 4.8 (83.4 vs 82.7); Fable 5 retakes the lead with room to spare.

Knowledge work and reasoning

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
GDPval-AA (Elo)	1932	1890	1769	1314
Humanity's Last Exam (no tools)	59.0	49.8	41.4	44.4
Humanity's Last Exam (with tools)	64.5	57.9	52.2	51.4
GDP.pdf	29.8	22.5	24.9	16.7

GDPval-AA deserves a word of explanation, because Elo is not a percentage. The benchmark pits models against each other on real economically valuable knowledge work - the kind of deliverables professionals are paid to produce - and scores them like chess players, by win rate in head-to-head comparisons. A 163-point Elo gap (Fable 5 vs GPT-5.5) implies Fable 5's output is preferred well over 70% of the time; the 618-point gap to Gemini 3.1 Pro is a different league entirely. For anyone evaluating these models for report-writing, analysis, or document work, this is arguably the most decision-relevant number on the page.

The Humanity's Last Exam jump (59.0 no-tools, up 9.2 points) is also notable because it cannot be attributed to tooling - that is raw knowledge and reasoning.

Computer use and vision

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
OSWorld-Verified	85.0	83.4	78.7	76.2
Blueprint-Bench 2	38.6	14.5	36.2	26.5

OSWorld-Verified (driving a real computer through its GUI) shows a modest 1.6-point gain - this was already a Claude strength. Blueprint-Bench 2, which tests interpretation of technical drawings, is the opposite story: Opus 4.8 was weak at 14.5, GPT-5.5 led the field, and Fable 5 jumps 2.7x to take the top spot. Anecdotally, the vision gains track with launch demos: Fable 5 completed Pokemon FireRed using vision alone and rebuilt a working web app from nothing but screenshots.

Domain-specific: legal, health, biology, cyber

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
Legal Agent Benchmark	13.3	10.4	2.1	0.0
HealthBench Professional	66.0	56.9	51.8	-
BioMysteryBench (hard)	46.1	40.0	-	-
ExploitBench	78.0	40.0	34.0	-

The Legal Agent Benchmark numbers look tiny, but read them as a difficulty signal: Gemini 3.1 Pro scores zero and GPT-5.5 barely registers, while Fable 5's 13.3 is supported by qualitative evidence - a blind review by Davis Polk lawyers called its work "materially different." ExploitBench is the eye-opener: 78.0 against Opus 4.8's 40.0, nearly doubling the previous Claude score. That result is precisely why the unrestricted Mythos 5 configuration is limited to vetted cyberdefense partners.

Where GPT-5.5 still leads

OpenAI keeps two significant leads: ARC-AGI-2 (85.0) and GPQA Diamond (94.4). The pattern is coherent. Both reward abstract, puzzle-like reasoning - novel pattern induction and graduate-level science questions answered in a single sitting. Fable 5's wins cluster around agentic execution: long tasks, tools, real environments, and self-verification across many steps. The frontier, in other words, has split into specializations: GPT-5.5 remains the strongest pure abstract reasoner on these tests, while Fable 5 is the strongest at sustained, goal-directed work. Which matters more depends entirely on your workload.

Honest caveats

Read launch benchmarks skeptically. These are vendor-reported numbers, published by Anthropic on launch day, with harness details and effort settings under Anthropic's control. Competitor scores may not reflect those vendors' best configurations.

Saturation debate. Several classic benchmarks are near ceiling for all frontier models, which is why newer, harsher tests (FrontierCode Diamond, GDP.pdf, Legal Agent Benchmark) dominate this cycle - and those have shorter track records and less independent scrutiny.
Missing cells. Gemini 3.1 Pro has no published score on several rows; absence of a number is not a zero (except where 0.0 is the actual score).
Benchmarks are not workloads. Early customer reports - Cursor calling it "state of the art on CursorBench," Hebbia's "first to break 90% on our core analytics benchmark" - are encouraging precisely because they are independent evals, but your own tasks remain the test that matters.

With those caveats logged, the overall shape is hard to argue with: across thirteen published evaluations spanning four categories, Fable 5 leads on every one Anthropic reported, often by the largest single-generation margins in recent memory. For what that means for your wallet, see our pricing breakdown.