AI Still Struggles with White-Collar Realities, New Benchmark Reveals
7
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
While AI is undoubtedly advancing rapidly, the Apex-Agents benchmark’s findings highlight a critical reality check, tempering inflated expectations. The score represents a key indicator for the field, and this level of performance suggests significant development is still needed before AI can truly dominate white-collar professions.
Article Summary
New research from Mercor reveals a persistent gap between the capabilities of leading AI models and the demands of professional white-collar work. The Apex-Agents benchmark, designed to mirror the intricate processes of consulting, investment banking, and legal professions, consistently shows models struggling with tasks requiring multi-domain reasoning, information tracking across multiple tools, and nuanced understanding of professional workflows. While models like Gemini 3 Flash and GPT-5.2 achieved some success – around 24% accuracy – the vast majority of queries still resulted in incorrect answers or no response. This is largely attributed to the models' difficulty in replicating the way humans operate across diverse tools and information sources – Slack, Google Drive, etc. – highlighting a key limitation in the current state of AI’s ability to truly replace complex knowledge work. The benchmark's focus on sustained, high-value tasks, rather than broad general knowledge, pushes AI systems to a level of performance that remains significantly below human professionals.Key Points
- AI models consistently score poorly (around 24% accuracy) on complex tasks mimicking professional white-collar work.
- The Apex-Agents benchmark, designed to reflect real-world professional workflows, reveals a significant challenge for current AI in replicating multi-domain reasoning and information tracking.
- Despite recent advancements, AI’s inability to seamlessly operate across diverse tools and information sources – like Slack and Google Drive – remains a substantial hurdle to automating sophisticated knowledge work.