GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

1 / 4

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

DEV Community·gentic news·25 days ago

#JJculxY5

#ai #machinelearning #research #deeplearning #clinical #real

Reading 0:00

15s threshold

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%. GPT-4.1 scored 24.65% top-3 diagnostic accuracy on 5,811 real hospital dermatology cases, down from 42.25% on public benchmarks. The 5,811-case multi-site study tested five multimodal LLMs and found benchmark performance substantially overestimates clinical capability. Key facts GPT-4.1 top-3 accuracy: 42.25% on benchmarks, 24.65% on real cases. Open-weight models: 1.50%-13.35% real-world top-3 accuracy. Clinical context boosted GPT-4.1 to 38.93% but brittle to errors. 5,811 cases, 46,405 images from multi-site hospital cohort. Triage sensitivity above 60% but insufficient for clinical deployment. A new study from researchers including Roy Jiang, Hyunjae Kim, and Zhenyue Qin quantifies the gap between multimodal LLM benchmark performance and real-world clinical dermatology.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks