Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
1 / 7
0

How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard

DEV Community·Hoyin kyoma·24 days ago
#Dg4Ssi0o
#ai#llm#agent#xanther#model#context
Reading 0:00
15s threshold

TL;DR We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every model on the official mini-SWE-agent leaderboard, including Claude Opus 4.5 (76.8%) which costs 37x more per call. The improvement comes entirely from better context, not a better model. Full benchmark results and interactive dashboard: xanther.ai/benchmarks Try it free: xanther.ai The Official Leaderboard (as of February 2026) The SWE-bench Verified leaderboard uses mini-SWE-agent as a standardized harness to evaluate models on 500 human-verified bug instances from real open-source Python repositories.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More