Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%. GPT-4.1 scored 24.65% top-3 diagnostic accuracy on 5,811 real hospital dermatology cases, down from 42.25% on public benchmarks. The 5,811-case multi-site study tested five multimodal LLMs and found benchmark performance substantially overestimates clinical capability. Key facts GPT-4.1 top-3 accuracy: 42.25% on benchmarks, 24.65% on real cases. Open-weight models: 1.50%-13.35% real-world top-3 accuracy. Clinical context boosted GPT-4.1 to 38.93% but brittle to errors. 5,811 cases, 46,405 images from multi-site hospital cohort. Triage sensitivity above 60% but insufficient for clinical deployment. A new study from researchers including Roy Jiang, Hyunjae Kim, and Zhenyue Qin quantifies the gap between multimodal LLM benchmark performance and real-world clinical dermatology.…