Home / Technology / AI Fails White-Collar Job Test: New Benchmark Reveals Flaws
AI Fails White-Collar Job Test: New Benchmark Reveals Flaws
23 Jan
Summary
- New Apex-Agents benchmark shows AI models failing white-collar tasks.
- Models struggle most with multi-domain information retrieval.
- Top models achieved only 24% accuracy on complex professional queries.

Despite predictions of AI replacing knowledge work, recent research indicates a slow change. Mercor's new Apex-Agents benchmark, designed to mimic real professional tasks in consulting, investment banking, and law, found that leading AI models are currently failing. These advanced models struggled to correctly answer more than a quarter of the complex queries presented.
The primary challenge for AI lies in its difficulty with multi-domain information retrieval, a core aspect of human knowledge work that often involves integrating data from various platforms like Slack and Google Drive. This limitation was evident in queries requiring in-depth analysis of company policies and relevant laws.
While OpenAI's GDPVal benchmark assesses general knowledge, Apex-Agents focuses on sustained task performance in specific high-value professions. Even the top performers, like Gemini 3 Flash at 24% accuracy, demonstrate that AI is not yet ready to automate these roles. However, the rapid pace of AI development suggests this benchmark will soon be surpassed.




