Architecture Teardown: How Meta Trains LLMs for Code Generation on 100k GPU Clusters

1 / 2

Architecture Teardown: How Meta Trains LLMs for Code Generation on 100k GPU Clusters

DEV Community·ANKUSH CHOUDHARY JOHAL·about 1 month ago

#lhXiQyzF

#code #how #architecture #teardown #training #meta

Reading 0:00

15s threshold

In Q3 2024, Meta trained a 70B parameter code-specialized LLM on 100,000 Nvidia H100 GPUs, achieving 214 TFLOPS per GPU and 92% cluster utilization – a 3x improvement over their 2023 16k A100 cluster runs, with total training cost of $17.4M for 21 days of continuous operation. 📡 Hacker News Top Stories Right Now Ghostty is leaving GitHub (2250 points) Bugs Rust won't catch (158 points) How ChatGPT serves ads (265 points) Before GitHub (389 points) Show HN: Auto-Architecture: Karpathy's Loop, pointed at a CPU (88 points) Key Insights Meta’s custom collective communication library (C3) achieves 98.7% bandwidth utilization on 100k H100 nodes, vs 89% for NCCL 2.18.3 PyTorch 2.3.0 with FSDP2 and custom activation checkpointing reduces memory footprint by 41% vs vanilla FSDP for 70B code models Total training cost for 70B model on 100k GPUs is $17.4M, 22% cheaper than equivalent 16k A100 cluster runs when accounting for H100’s 3.2x throughput By 2026, Meta plans to deploy 250k GPU clusters with custom silicon,…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Architecture Teardown: How Meta Trains LLMs for Code Generation on 100k GPU Clusters