RL from verifiable rewards now beats GRPO baselines by a comfortable margin, and the advantage comes from assigning credit at far finer granularity than whole‑response scores. By turning verification into token‑ and subproblem‑level signals, the newest methods extract learning from progress that would otherwise be discarded. Before these works, reinforcement learning for reasoning relied on a single scalar reward per generated answer. GRPO and similar RL‑HF pipelines treated the whole response as the unit of credit, which made credit assignment noisy and left hard problems stuck in “gradient dead zones.” No mechanism existed to reward partial solves or to isolate the effect of a single token on the final verdict. DelTA’s discriminative token credit assignment reshapes the RL update into a linear discriminator over token‑gradient vectors, amplifying side‑specific directions while suppressing shared noise.…