Sharing a simple Python script to benchmark LLM inference latency across different providers

1 / 2

Sharing a simple Python script to benchmark LLM inference latency across different providers

DEV Community·sbt112321321·19 days ago

#2wjin4mh

#ai #tutorial #python #api #time #providers

Reading 0:00

15s threshold

Was tinkering with some latency measurements lately and wanted to share a quick Python snippet that might help others evaluating inference endpoints. The goal was simple: send identical prompts to different providers and measure time-to-first-token and total generation time. Nothing fancy, but useful when you're trying to decide where to route production traffic. Here's the setup I used with the DeepSeek-V4-Pro model: import time import requests API_BASE = " https://api.api.novapai.ai/v1 " API_KEY = " your-key-here " headers = { " Authorization " : f " Bearer { API_KEY } " , " Content-Type " : " application/json " } payload = { " model " : " DeepSeek-V4-Pro " , " messages " : [ { " role " : " system " , " content " : " You are a helpful assistant. " }, { " role " : " user " , " content " : " Explain transformer attention mechanism in detail. " } ], " temperature " : 0.7 , " max_tokens " : 512 , " stream " : True } ttft_start = time . time () ttft_measured = False try : response = requests .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Sharing a simple Python script to benchmark LLM inference latency across different providers