Recent work shows that aligning synthetic data with a student’s style can recover reasoning ability lost during fine‑tuning, and that key‑value (KV) cache tricks can slash the FLOP and memory budget by orders of magnitude with negligible accuracy loss. The surprise is that these savings come without the dramatic drops that typically accompany aggressive compression. Fine‑tuning a weaker model on teacher‑generated code often harms the very capabilities it seeks to inherit. Standard practice replaces the student’s data entirely with the teacher’s output, assuming raw reasoning power will transfer. Likewise, on‑policy distillation usually ingests every token from a rollout, and inference caches retain every KV pair, inflating both compute and GPU memory. The field has long accepted these inefficiencies as the price of performance. TESSY interleaves a teacher and its student while generating data, forcing the teacher to emit “style” tokens that match the student’s distribution.…