When deploying large language models to production, measuring performance accurately is critical. Whether you're using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand: Throughput : How many requests per second can your system handle? Latency metrics : Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency Token generation speed : Tokens per second under different concurrency levels Tail latency : P95 and P99 values that affect user experience In this post, I'll walk through the key metrics for benchmarking language models and share why I built llmperf-rs , a Rust-based benchmarking tool that takes a different approach to measuring these metrics. The Problem with Existing Tools While working with ray-project/llmperf (now archived), I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages.β¦