Home / Technology / AI Fails Rigorous 'Last Exam' Benchmark

AI Fails Rigorous 'Last Exam' Benchmark

11 Jun

Summary

New AI benchmark measures economically valuable, long-horizon workflows.
OpenAI's GPT-5.5 leads with a 24.0% pass rate; most models fail.
Benchmark uses deterministic evaluation, avoiding LLM judgment issues.

Researchers at UC Berkeley's Center for Responsible, Decentralized Intelligence have launched Agents' Last Exam (ALE), a new benchmark to evaluate AI's capability in executing long-horizon professional workflows. Developed with over 300 domain experts, ALE aims to bridge the gap between AI hype and real-world economic impact.

In a significant finding, OpenAI's GPT-5.5 achieved the highest pass rate at 24.0%, surpassing Anthropic's Claude Fable 5. However, the overall results highlight substantial limitations, with most AI models failing the demanding tasks across 55 U.S. occupational domains.

The ALE benchmark employs a Generalist Computer-Use Agent (GCUA) framework, assessing reasoning, perception, orchestration, tool invocation, and runtime substrate. It utilizes deterministic, code-based evaluations for artifacts like 3D models or parsed filings, minimizing reliance on subjective LLM judgments.

To combat benchmark contamination, ALE employs a dual-use strategy, releasing only a portion of its tasks publicly. This ensures that AI models are genuinely solving problems rather than memorizing test answers, providing a more accurate measure of capability for enterprise deployment.

Disclaimer: This story has been auto-aggregated and auto-summarised by a computer program. This story has not been edited or created by the Feedzop team.

Home / Technology / AI Fails Rigorous 'Last Exam' Benchmark

AI Fails Rigorous 'Last Exam' Benchmark

11 Jun

•

Summary

New AI benchmark measures economically valuable, long-horizon workflows.
OpenAI's GPT-5.5 leads with a 24.0% pass rate; most models fail.
Benchmark uses deterministic evaluation, avoiding LLM judgment issues.

Disclaimer: This story has been auto-aggregated and auto-summarised by a computer program. This story has not been edited or created by the Feedzop team.