SWE-bench Verification No Longer Reliable: A Critical Update
7
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
Significant media attention around a critical reassessment of a prominent AI benchmark, but the underlying issue – the inherent limitations of relying on AI-exposed metrics – has long-standing implications. This isn't a dramatic shift, but a necessary correction that will shape the future of AI evaluation.
Article Summary
OpenAI has withdrawn its reporting of SWE-bench Verified scores, a benchmark designed to track the progress of AI models in autonomous software engineering tasks. Originally released in 2024, the benchmark has been found increasingly unreliable due to fundamental flaws. A recent audit identified a significant portion of the benchmark's problems – 59.4% – contained material issues in test design and/or problem description, making them practically impossible for even advanced models to solve. Furthermore, models were found to learn from the benchmark itself through training, leading to artificially inflated scores that don't reflect real-world development abilities. Specifically, some tests enforced implementation details not present in the original problem description, while others checked for additional functionality that wasn't specified. These issues stemmed primarily from the benchmark's test cases being overly specific or wide, and from models being exposed to the benchmark during training. This revelation has significant implications for how AI model capabilities are assessed and compared. OpenAI recommends using SWE-bench Pro, and is building new, uncontaminated evaluations to accurately track coding skills.Key Points
- SWE-bench Verified is no longer a reliable metric for measuring AI coding capabilities.
- The benchmark is contaminated due to models learning from the test suite during training.
- Over 59% of problems contained flaws in test design or problem descriptions, making them effectively unsolvable for models.

