Menu

Post image 1
Post image 2
1 / 2
0

Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent

DEV Community·kuroko·about 1 month ago
#oYZ6RgzH
#llm#rust#ollama#model#qwen3#file
Reading 0:00
15s threshold

What can a 16GB GPU and a local LLM actually do for everyday coding work? I built an 11-task benchmark to find out and ran four open-weight models (9B to 35B; the 35B is an MoE with 3B active per token) through it. 232 runs in total. A single RTX 5060 Ti with 16GB VRAM. Headline: the biggest, newest model (Qwen3.6-35B-A3B) won at 100% pass rate (29/29 runs, 11/11 tasks) after some tuning. The previous-gen qwen3.5:9b — older and smaller — passed 9/11 tasks at 24s/run average, roughly one third the wall time of the 35B. So the more interesting question turns out not to be "which model wins" but "do you actually need the latest, biggest model": The benchmark found three bugs in my own agent before it surfaced anything interesting about the models. Picking the right quantization (UD-Q3_K_M instead of Q4_K_M) was worth ~33% on average and saved one model from CPU offload entirely — but the same quant under FP16 KV cache blew up on two tasks specifically.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More