Semantic and Adaptive Evaluation of LLMs Recent work moves past word‑overlap scores toward semantic, uncertainty‑aware testing. TRACER trains tiny classifiers on live model traces and only accepts outputs that pass an agreement check; it reaches full coverage on intent classification benchmarks while avoiding costly LLM judges [1] . A complementary line adds a test‑time “zoom‑in” step that refines predictions for GUI grounding whenever the model’s confidence drops, improving accuracy by 13.4 % without extra training data [2] . Together these approaches expose reasoning fragility—accuracies fall by more than 50 % under systematic perturbations—suggesting that future benchmarks must reflect downstream utility rather than static lexical overlap [3] . Diffusion and Flow Matching across Language, Vision, and 3D Diffusion models are no longer confined to image generation.…