Claude Fable 5 Benchmarks: The Complete Picture
Anthropic published an unusually broad evaluation suite with the Fable 5 launch, spanning software engineering, knowledge work, computer use, and domain-specific tests in law, health, biology, and cybersecurity. Fable 5 sets the state of the art on nearly all of them - on several by more than 10 points over Opus 4.8 - while GPT-5.5 keeps two notable crowns. Here is every published number, organized by category, with analysis and the caveats that should accompany any launch-day scorecard.
Unless noted, comparisons are Claude Fable 5 / Claude Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro. A dash means no published score.
Coding and agentic engineering
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 80.3 | 69.2 | 58.6 | 54.2 |
| FrontierCode Diamond | 29.3 | 13.4 | 5.7 | - |
| Terminal-Bench 2.1 | 88.0 | 82.7 | 83.4 | 70.7 |
SWE-Bench Pro is the headline: an 11.1-point jump over Opus 4.8 and a 21.7-point lead over GPT-5.5 on realistic software-engineering tasks. But FrontierCode Diamond is the more interesting result. It is designed to be brutally hard - problems at the edge of what working engineers can do - and Fable 5's 29.3 is 2.2x Opus 4.8's score and more than 5x GPT-5.5's. On Terminal-Bench, note that GPT-5.5 actually edged Opus 4.8 (83.4 vs 82.7); Fable 5 retakes the lead with room to spare.
Knowledge work and reasoning
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| GDPval-AA (Elo) | 1932 | 1890 | 1769 | 1314 |
| Humanity's Last Exam (no tools) | 59.0 | 49.8 | 41.4 | 44.4 |
| Humanity's Last Exam (with tools) | 64.5 | 57.9 | 52.2 | 51.4 |
| GDP.pdf | 29.8 | 22.5 | 24.9 | 16.7 |
GDPval-AA deserves a word of explanation, because Elo is not a percentage. The benchmark pits models against each other on real economically valuable knowledge work - the kind of deliverables professionals are paid to produce - and scores them like chess players, by win rate in head-to-head comparisons. A 163-point Elo gap (Fable 5 vs GPT-5.5) implies Fable 5's output is preferred well over 70% of the time; the 618-point gap to Gemini 3.1 Pro is a different league entirely. For anyone evaluating these models for report-writing, analysis, or document work, this is arguably the most decision-relevant number on the page.
The Humanity's Last Exam jump (59.0 no-tools, up 9.2 points) is also notable because it cannot be attributed to tooling - that is raw knowledge and reasoning.
Computer use and vision
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| OSWorld-Verified | 85.0 | 83.4 | 78.7 | 76.2 |
| Blueprint-Bench 2 | 38.6 | 14.5 | 36.2 | 26.5 |
OSWorld-Verified (driving a real computer through its GUI) shows a modest 1.6-point gain - this was already a Claude strength. Blueprint-Bench 2, which tests interpretation of technical drawings, is the opposite story: Opus 4.8 was weak at 14.5, GPT-5.5 led the field, and Fable 5 jumps 2.7x to take the top spot. Anecdotally, the vision gains track with launch demos: Fable 5 completed Pokemon FireRed using vision alone and rebuilt a working web app from nothing but screenshots.
Domain-specific: legal, health, biology, cyber
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Legal Agent Benchmark | 13.3 | 10.4 | 2.1 | 0.0 |
| HealthBench Professional | 66.0 | 56.9 | 51.8 | - |
| BioMysteryBench (hard) | 46.1 | 40.0 | - | - |
| ExploitBench | 78.0 | 40.0 | 34.0 | - |
The Legal Agent Benchmark numbers look tiny, but read them as a difficulty signal: Gemini 3.1 Pro scores zero and GPT-5.5 barely registers, while Fable 5's 13.3 is supported by qualitative evidence - a blind review by Davis Polk lawyers called its work "materially different." ExploitBench is the eye-opener: 78.0 against Opus 4.8's 40.0, nearly doubling the previous Claude score. That result is precisely why the unrestricted Mythos 5 configuration is limited to vetted cyberdefense partners.
Where GPT-5.5 still leads
OpenAI keeps two significant leads: ARC-AGI-2 (85.0) and GPQA Diamond (94.4). The pattern is coherent. Both reward abstract, puzzle-like reasoning - novel pattern induction and graduate-level science questions answered in a single sitting. Fable 5's wins cluster around agentic execution: long tasks, tools, real environments, and self-verification across many steps. The frontier, in other words, has split into specializations: GPT-5.5 remains the strongest pure abstract reasoner on these tests, while Fable 5 is the strongest at sustained, goal-directed work. Which matters more depends entirely on your workload.
Honest caveats
- Saturation debate. Several classic benchmarks are near ceiling for all frontier models, which is why newer, harsher tests (FrontierCode Diamond, GDP.pdf, Legal Agent Benchmark) dominate this cycle - and those have shorter track records and less independent scrutiny.
- Missing cells. Gemini 3.1 Pro has no published score on several rows; absence of a number is not a zero (except where 0.0 is the actual score).
- Benchmarks are not workloads. Early customer reports - Cursor calling it "state of the art on CursorBench," Hebbia's "first to break 90% on our core analytics benchmark" - are encouraging precisely because they are independent evals, but your own tasks remain the test that matters.
With those caveats logged, the overall shape is hard to argue with: across thirteen published evaluations spanning four categories, Fable 5 leads on every one Anthropic reported, often by the largest single-generation margins in recent memory. For what that means for your wallet, see our pricing breakdown.
Related reading
- Fable 5 vs Opus 4.8: is 2x the price worth it?
- Claude Fable 5 pricing explained
- What is Claude Mythos 5? The model behind Fable 5
- Getting started with the Fable 5 API
- Using Claude Fable 5 in Claude Code
- The Fable 5 effort parameter guide
- Migrating to Claude Fable 5
- Inside Fable 5's Mythos-class safety system
- Claude model comparison