Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text

1 / 5

Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text

DEV Community·Raihan·24 days ago

#lXvKxgHp

#ai #machinelearning #python #model #cost #patient

Reading 0:00

15s threshold

Healthcare practices drown in inbound patient text. Email, contact forms, live chat, SMS, voicemail transcripts — every channel sends messages that need to be routed: to scheduling, to billing, to clinical, to the front desk. It's a high-volume, deterministic, latency-sensitive task. The obvious answer in 2026 is to throw a frontier LLM at it. Claude Haiku 4.5 will give you 95% accuracy on this kind of classification. GPT-4o will too. But every call costs real money, adds about a second of network round-trip, and sends patient text to a third party that doesn't have a BAA with you. I built a small alternative — a 184M-parameter DeBERTa-v3-base fine-tune — and benchmarked it against Claude Haiku 4.5, Claude Sonnet 4.6, and GPT-4o on a 1,154-example test set. The fine-tuned model lands within 4 percentage points of accuracy of the best frontier model, runs 22× faster on a CPU, and costs effectively $0 per inference after training. Total cost to build it: under $3.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text