Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

1 / 2

Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

DEV Community·Vilius·20 days ago

#SaThxmcU

#ai #benchmark #model #models #coding #score

Reading 0:00

15s threshold

The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected. The headline: tiny models won Rank Model Score 🥇 SmolLM3 3B 93.3 🥈 Phi-4-mini 90.0 🥉 Claude Sonnet 4 85.0 4 Qwen2.5 1.5B 85.0 5 Qwen2.5 3B 85.0 6 Granite 3.2 2B 82.5 7 Ministral 3B 81.7 8 Mistral Large 3 79.6 9 Gemma 4 31B 78.3 10 Gemma 4 26B A4B 78.3 A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0. Frontier model results Model Score Claude Sonnet 4 85.0 Gemini 2.5 Flash 76.4 GPT-5.4 76.6 Kimi K2.6 75.0 Grok 4.20 75.0 MiniMax M2.7 69.9 DeepSeek V4 Flash 60.0 GPT-5.5 60.0 GPT-5.4 Pro 51.6 GPT-5.5 Pro 43.3 DeepSeek V4 Pro 38.3 Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested