The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
Summary
OpenAI’s eval leaders argue SWE-Bench Verified is now saturated and contaminated, so it no longer tracks real coding progress; the field should move to harder, less contaminated benchmarks (e.g., SWE-Bench Pro). Actionable takeaways for builders: do not over-index on tiny benchmark gains on saturated evals; test for contamination and overly narrow tests; and measure longer-horizon, open-ended tasks (design quality, maintainability, multi-hour tasks) rather than just patch correctness. For researchers, invest in human-verified rubrics when you need judgment-based scoring, and combine human data with automated grading to scale. Companies mentioned: OpenAI (Frontier Evals, Codex, human data), Scale (SWE-Bench Pro). Career advice: benchmark and evaluation work is increasingly high impact; engineers who can design reliable, human-validated evals and encode real-world task complexity will be in demand.
Chapter Summaries
- Chapter 1: Why SWE-Bench Verified is “over.” The benchmark is saturated and contaminated, so it’s no longer a reliable North Star.
- Chapter 2: How SWE-Bench Verified was built. OpenAI hired ~100 engineers to curate 500 tasks and triple-review fairness.
- Chapter 3: New audit findings. Many failures are due to narrow or unfair tests and contamination evidence in frontier models.
- Chapter 4: Moving to SWE-Bench Pro. It’s harder, more diverse, and shows less contamination; provides more headroom.
- Chapter 5: What future benchmarks should measure. Longer tasks, open-ended design choices, code quality, and maintainability.
- Chapter 6: Human vs automated grading. Human rubrics are costly but crucial for judgment-heavy evals; automation scales once rubrics exist.
- Chapter 7: Preparedness framework. Evals track dual-use risks (bio, cyber, autonomy) and should evolve toward real-world impact metrics.
- Chapter 8: Community call. Build and share tougher, well-scored evals; track real-world usage and labor impact.