Reproducing Chinchilla Scaling on a Budget

1 / 4

Reproducing Chinchilla Scaling on a Budget

DEV Community·Thokozani Buthelezi·about 1 month ago

#gLxiX5E7

#ai #deeplearning #python #model #compute #loss

Reading 0:00

15s threshold

Training a 70B parameter model costs millions of dollars. Scaling laws exist so you don't have to guess how to spend that budget. Here's what I learned reproducing them on a free GPU. Introduction Scaling laws are basically rules that tell us how model performance improves as you increase quantities such as model size, dataset size, and compute. Instead of guessing "bigger models = better", scaling laws gives a mathematical relationship between: model size (N, number of parameters) dataset size (D, number of tokens) compute (C, number of training FLOPs) loss (L, how wrong the model is) the core idea L ( N , D ) = A N α + B D β + E L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E This looks intimidating but it's simple: increasing N(model size) -> loss goes down increasing D(data) -> loss goes down but both have diminishing returns because of the scaling exponents (α,β) where E is the irreducible entropy error of the model The relationship between the loss and these quantities is not linear, it…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Reproducing Chinchilla Scaling on a Budget