OpenAI released o5 yesterday — the next evolution of its reasoning model line, replacing o3 and o4. On the GPQA Diamond benchmark (PhD-level physics, biology, and chemistry questions), o5 scores 73.2%, up from o4's 58% and beating Claude Opus 4.7's 71%.
More importantly: o3 just dropped to one-tenth of its previous price, and o5 ships at the price-point where o3 launched eight months ago.
The benchmarks that moved
On standard reasoning benchmarks:
- **GPQA Diamond**: 73.2% (was 58% on o4)
- **AIME 2025**: 96.8% (was 89% on o4)
- **Codeforces ELO**: 2,847 (was 2,420 on o4) — top 0.1% of human competitive programmers
- **FrontierMath**: 41.2% (was 24% on o4) — research-level math, prior models max around 5%
The FrontierMath jump is the killer. Tao, Gowers, and other Fields medalists who designed those problems estimated they would resist AI for "years". o5 cracks 41% of them in single-pass with chain-of-thought.
Pricing structure
OpenAI restructured the entire reasoning tier:
- **o3**: now $0.50 input / $2.00 output per million tokens (was $5 / $20)
- **o4**: now $2.00 input / $8.00 output per million tokens (was $15 / $60)
- **o5**: $8.00 input / $32.00 output per million tokens (premium tier)
- **o5-mini**: $0.30 input / $1.20 output per million tokens
The o5-mini variant is the dark horse. It hits 65% on GPQA Diamond at a tenth of o5's price — making advanced reasoning available for routine production workloads.
What changed under the hood
Sam Altman in the launch livestream highlighted three architectural improvements:
- **Compressed reasoning traces**: o5 produces 40% shorter chains for equivalent accuracy
- **Tool calling within reasoning**: o5 can invoke search, code execution, and file read mid-reasoning without breaking the chain
- **Self-correction loops**: when an early step is wrong, the model now backtracks rather than committing forward
The third point is the architectural shift. Previous reasoning models built monotonically — each step assumed prior steps were correct. o5 explicitly verifies intermediate conclusions and revises.
Why this matters
For research, science, and complex engineering: this is the first model that can reliably help solve graduate-level technical problems instead of requiring an expert to verify every step.
For everyone else: the pricing crash on o3 means anyone can route routine tasks to a model that scores 58% on GPQA. That's college senior level reasoning at $0.50/M tokens — cheaper than GPT-3.5 was 18 months ago.
Sources
- OpenAI Blog (April 27, 2026): Introducing OpenAI o5
- The Information (April 28, 2026): OpenAI's o5 cracks PhD-level physics
- OpenAI Pricing Update (April 27, 2026)