I am responsible for a research project that is supposed to train a GPT-like model (Transformer-decoder) with 100M, 250M and 500M model variants. params training dataset 750M tokens vocabulary is ~15k to ~100k tokens (depends on tokenizer settings) ~3% of the vocabulary is used in ~50% of the training tokens (similar to language, where most of the vocabulary is used very sparsely) training hyper-params optimizer = AdamW lr = 1e-3 (works the best compared to 1e-2 and 1e-4) betas = [0.9, 0.95] effective batch size = 4M tokens epoch = 16 warmup steps ~200 (approx 1 epoch) model hyper-params 16 layers (but variants with up to 48 layers were tested) embedding = flexible to yield 100M, 250M and 500M model MLP size = 4*n_embd 16 attention heads context window = 1000 Issue The model seems to fail to learn the basic auto-regressive behavior. It often gets stuck on generating a single token (no repetition penalty, no sampling yet). Is training GPT-like models still a black magic? Is there some trick to this?…