Menu

Post image 1
Post image 2
1 / 2
0

Post‑training tricks cut LLM cost without losing ability

DEV Community·Papers Mache·26 days ago
#rW2XNBYu
Reading 0:00
15s threshold

Recent work shows that aligning synthetic data with a student’s style can recover reasoning ability lost during fine‑tuning, and that key‑value (KV) cache tricks can slash the FLOP and memory budget by orders of magnitude with negligible accuracy loss. The surprise is that these savings come without the dramatic drops that typically accompany aggressive compression. Fine‑tuning a weaker model on teacher‑generated code often harms the very capabilities it seeks to inherit. Standard practice replaces the student’s data entirely with the teacher’s output, assuming raw reasoning power will transfer. Likewise, on‑policy distillation usually ingests every token from a rollout, and inference caches retain every KV pair, inflating both compute and GPU memory. The field has long accepted these inefficiencies as the price of performance. TESSY interleaves a teacher and its student while generating data, forcing the teacher to emit “style” tokens that match the student’s distribution.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More