In Q3 2024, Meta trained a 70B parameter code-specialized LLM on 100,000 Nvidia H100 GPUs, achieving 214 TFLOPS per GPU and 92% cluster utilization – a 3x improvement over their 2023 16k A100 cluster runs, with total training cost of $17.4M for 21 days of continuous operation. 📡 Hacker News Top Stories Right Now Ghostty is leaving GitHub (2250 points) Bugs Rust won't catch (158 points) How ChatGPT serves ads (265 points) Before GitHub (389 points) Show HN: Auto-Architecture: Karpathy's Loop, pointed at a CPU (88 points) Key Insights Meta’s custom collective communication library (C3) achieves 98.7% bandwidth utilization on 100k H100 nodes, vs 89% for NCCL 2.18.3 PyTorch 2.3.0 with FSDP2 and custom activation checkpointing reduces memory footprint by 41% vs vanilla FSDP for 70B code models Total training cost for 70B model on 100k GPUs is $17.4M, 22% cheaper than equivalent 16k A100 cluster runs when accounting for H100’s 3.2x throughput By 2026, Meta plans to deploy 250k GPU clusters with custom silicon,…