Deep Dive: How PyTorch 2.5’s Compiled Mode Speeds Up Inference on AWS Inferentia 3

1 / 2

Deep Dive: How PyTorch 2.5’s Compiled Mode Speeds Up Inference on AWS Inferentia 3

DEV Community·ANKUSH CHOUDHARY JOHAL·about 1 month ago

#Bhg7Ck2J

#deep #dive #pytorch #compiled #model #torch

Reading 0:00

15s threshold

\n In Q3 2024 benchmarks, PyTorch 2.5’s compiled mode delivered 3.2x higher inference throughput on AWS Inferentia 3 for BERT-Large workloads compared to eager mode, cutting p99 latency from 210ms to 65ms while reducing per-inference cost by 42%. \n\n 📡 Hacker News Top Stories Right Now Talkie: a 13B vintage language model from 1930 (410 points) The World's Most Complex Machine (82 points) Microsoft and OpenAI end their exclusive and revenue-sharing deal (900 points) Who owns the code Claude Code wrote? (34 points) Is my blue your blue? (2024) (591 points) \n\n Key Insights PyTorch 2.5 compiled mode reduces Inferentia 3 kernel launch overhead by 78% via ahead-of-time graph lowering to Neuron SDK 2.19. AWS Neuron SDK 2.19 adds first-class support for PyTorch 2.5's torch.compile() with custom backend registration for Inferentia 3's NeuronCore v3. Teams migrating from Inferentia 2 to Inferentia 3 with PyTorch 2.5 compiled mode see 62% lower per-inference costs than equivalent GPU-based deployments.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Deep Dive: How PyTorch 2.5’s Compiled Mode Speeds Up Inference on AWS Inferentia 3