Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

1 / 2

Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

DEV Community·Pawan Kumar·19 days ago

#vt8UAcCS

#kubernetes #time #ai #model #memory #request

Reading 0:00

15s threshold

Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes. That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. A request is no longer just a request. Memory does not just mean RAM. Latency is not one number. Scaling a pod does not mean capacity appears instantly. One "replica" may need one GPU, eight GPUs, or several machines working together. And the bottleneck may not be CPU at all. The first mental shift is simple: LLM serving is not normal web serving. The real unit of work is the token. A request is no longer a request In a normal web app, request count is often a useful planning signal. Not perfect, obviously.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Everything You Know About Scaling Web Apps Breaks When You Serve an LLM