Home / Technology / AI Fails Rigorous 'Last Exam' Benchmark
AI Fails Rigorous 'Last Exam' Benchmark
11 Jun
Summary
- New AI benchmark measures economically valuable, long-horizon workflows.
- OpenAI's GPT-5.5 leads with a 24.0% pass rate; most models fail.
- Benchmark uses deterministic evaluation, avoiding LLM judgment issues.

Researchers at UC Berkeley's Center for Responsible, Decentralized Intelligence have launched Agents' Last Exam (ALE), a new benchmark to evaluate AI's capability in executing long-horizon professional workflows. Developed with over 300 domain experts, ALE aims to bridge the gap between AI hype and real-world economic impact.
In a significant finding, OpenAI's GPT-5.5 achieved the highest pass rate at 24.0%, surpassing Anthropic's Claude Fable 5. However, the overall results highlight substantial limitations, with most AI models failing the demanding tasks across 55 U.S. occupational domains.
The ALE benchmark employs a Generalist Computer-Use Agent (GCUA) framework, assessing reasoning, perception, orchestration, tool invocation, and runtime substrate. It utilizes deterministic, code-based evaluations for artifacts like 3D models or parsed filings, minimizing reliance on subjective LLM judgments.
To combat benchmark contamination, ALE employs a dual-use strategy, releasing only a portion of its tasks publicly. This ensures that AI models are genuinely solving problems rather than memorizing test answers, providing a more accurate measure of capability for enterprise deployment.