Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
Post image 8
Post image 9
Post image 10
Post image 11
Post image 12
Post image 13
Post image 14
Post image 15
Post image 16
1 / 16
0

The Counterintuitive Networking Decisions Behind OpenAI’s 131,000-GPU Training Fabric | Towards Data Science

Towards Data Science·Gokul Chandra Purnachandra Reddy·18 days ago
#f8FVxIBr
Reading 0:00
15s threshold

. Accept packet loss on purpose. Spray each transfer across hundreds of random paths. If someone handed you this list of design decisions for a network connecting 131,000 GPUs, you would assume it was written by someone who had never operated a production network. A consortium of OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA built exactly this — and quietly inverted three decades of consensus about how high-performance data center networks should work. The protocol is called MRC, short for Multipath Reliable Connection. It was released on May 5, 2026 through the Open Compute Project . The accompanying research paper (Araujo et al., 2026) details its deployment across OpenAI’s largest NVIDIA GB200 supercomputers, including the Stargate site with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has been used to train the latest frontier models behind ChatGPT and Codex.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More