In this blog post, we will see how to use NVIDIA AIPerf to expose a hidden performance problem that most LLM deployments never catch until real users start complaining. I ran three simple tests against a local model. The results tell a story that every performance engineer should see. The Setup For this experiment, I used: Model : granite4:350m running locally via Ollama Endpoint : http://localhost:11434 Tool : NVIDIA AIPerf (the official successor to GenAI-Perf) Head to https://github.com/ai-dynamo/aiperf to install AIPerf. It is a single pip install: pip install aiperf Granite 4 350M is a small, fast model perfect for local testing on a MacBook or a dev machine without a beefy GPU. The principles you will see here apply equally to larger models in cloud deployments. Run 1: The Baseline That Lies I started with the most common mistake in LLM performance testing a single-user baseline.…