Why SWE-bench Verified no longer measures frontier coding capabilities

📰

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI·@HashtagPLUS·about 1 month ago

#background #too #contamination #bench #verified #tests

Reading 0:00

15s threshold

Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework . When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset ⁠ (opens in a new window) . After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving ⁠ (opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the question: do the remaining failures reflect model limitations or properties of the dataset itself?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Why SWE-bench Verified no longer measures frontier coding capabilities