The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task

1 / 2

The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task

DEV Community·Thomas Landgraf·about 1 month ago

#cLWvhilM

#ai #agentskills #model #openai #harness #anthropic

Reading 0:00

15s threshold

If you have ever built an agent that walks a codebase, calls tools, and writes structured output, you have hit the same wall I kept hitting: the same model produces wildly different results on the same task depending on what harness you wrap it in. Swap Claude for GPT behind a single OPENAI_BASE_URL and you lose half your output quality. Everyone blames the model. The model is rarely the variable. I ran an experiment to put a number on it. Thirteen LLMs — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, GPT 5.4, GPT 5.4 Mini, two Gemini 3.1 previews, and six open-weights locals (Qwen 3.6 35B A3B, Gemma 4 at three sizes, GPT-OSS 20B, Nemotron 3 Nano) — on the same real agentic task. Same codebase ( excalidraw ), same MCP tools, same system prompt. Only the model changes. The output is a specification tree: goal → feature → requirement hierarchies of Markdown files. What if the SDK is doing more of the work than anyone admits? Every provider ships an SDK. Most teams assume the SDK is a thin wire-protocol wrapper.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task