Menu

Post image 1
Post image 2
1 / 2
0

Sharing a simple Python script to benchmark LLM inference latency across different providers

DEV Community·sbt112321321·19 days ago
#2wjin4mh
#ai#tutorial#python#api#time#providers
Reading 0:00
15s threshold

Was tinkering with some latency measurements lately and wanted to share a quick Python snippet that might help others evaluating inference endpoints. The goal was simple: send identical prompts to different providers and measure time-to-first-token and total generation time. Nothing fancy, but useful when you're trying to decide where to route production traffic. Here's the setup I used with the DeepSeek-V4-Pro model: import time import requests API_BASE = " https://api.api.novapai.ai/v1 " API_KEY = " your-key-here " headers = { " Authorization " : f " Bearer { API_KEY } " , " Content-Type " : " application/json " } payload = { " model " : " DeepSeek-V4-Pro " , " messages " : [ { " role " : " system " , " content " : " You are a helpful assistant. " }, { " role " : " user " , " content " : " Explain transformer attention mechanism in detail. " } ], " temperature " : 0.7 , " max_tokens " : 512 , " stream " : True } ttft_start = time . time () ttft_measured = False try : response = requests .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More