Training a 70B parameter model costs millions of dollars. Scaling laws exist so you don't have to guess how to spend that budget. Here's what I learned reproducing them on a free GPU. Introduction Scaling laws are basically rules that tell us how model performance improves as you increase quantities such as model size, dataset size, and compute. Instead of guessing "bigger models = better", scaling laws gives a mathematical relationship between: model size (N, number of parameters) dataset size (D, number of tokens) compute (C, number of training FLOPs) loss (L, how wrong the model is) the core idea L ( N , D ) = A N α + B D β + E L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E This looks intimidating but it's simple: increasing N(model size) -> loss goes down increasing D(data) -> loss goes down but both have diminishing returns because of the scaling exponents (α,β) where E is the irreducible entropy error of the model The relationship between the loss and these quantities is not linear, it…