Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
1 / 7
0

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo

NVIDIA Technical Blog·Ishan Dhanani·about 1 month ago
#pVrAmaO8
Reading 0:00
15s threshold

Coding agents are starting to write production code at scale. Stripe’s agents generate 1,300+ PRs per week . Ramp attributes 30% of merged PRs to agents . Spotify reports 650+ agent-generated PRs per month . Tools like Claude Code and Codex make hundreds of API calls per coding session, each carrying the full conversation history. Behind every one of these workflows is an inference stack under significant KV cache pressure. Figure 1. Cumulative KV cache reads outpace writes in agentic inference due to repeated reuse of prompt and context across sequential requests. Lets take Claude Code as an example. After the first API call that writes the conversation prefix to KV cache, every subsequent call to the same worker hits 85-97% cache. Agent teams (or swarms) push this further with 97.2% aggregate cache hit rate across 4 Opus teammates. An 11.7x read/write ratio means the system reads from cache nearly 12 times for every token it writes.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More