The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Latent Space · Latent.Space · February 23, 2026

Summary

OpenAI’s eval leaders argue SWE-Bench Verified is now saturated and contaminated, so it no longer tracks real coding progress; the field should move to harder, less contaminated benchmarks (e.g., SWE-Bench Pro). Actionable takeaways for builders: do not over-index on tiny benchmark gains on saturated evals; test for contamination and overly narrow tests; and measure longer-horizon, open-ended tasks (design quality, maintainability, multi-hour tasks) rather than just patch correctness. For researchers, invest in human-verified rubrics when you need judgment-based scoring, and combine human data with automated grading to scale. Companies mentioned: OpenAI (Frontier Evals, Codex, human data), Scale (SWE-Bench Pro). Career advice: benchmark and evaluation work is increasingly high impact; engineers who can design reliable, human-validated evals and encode real-world task complexity will be in demand.

Chapter Summaries

Chapter 1: Why SWE-Bench Verified is “over.” The benchmark is saturated and contaminated, so it’s no longer a reliable North Star.
Chapter 2: How SWE-Bench Verified was built. OpenAI hired ~100 engineers to curate 500 tasks and triple-review fairness.
Chapter 3: New audit findings. Many failures are due to narrow or unfair tests and contamination evidence in frontier models.
Chapter 4: Moving to SWE-Bench Pro. It’s harder, more diverse, and shows less contamination; provides more headroom.
Chapter 5: What future benchmarks should measure. Longer tasks, open-ended design choices, code quality, and maintainability.
Chapter 6: Human vs automated grading. Human rubrics are costly but crucial for judgment-heavy evals; automation scales once rubrics exist.
Chapter 7: Preparedness framework. Evals track dual-use risks (bio, cyber, autonomy) and should evolve toward real-world impact metrics.
Chapter 8: Community call. Build and share tougher, well-scored evals; track real-world usage and labor impact.