Serving 40 LoRA adapters on one base model: the throughput we got

1 / 2

Serving 40 LoRA adapters on one base model: the throughput we got

DEV Community: pytorch·Marcus Chen·3 days ago

#XOqrDKGf

#dev #lora #adapter #adapters #base #customer

Reading 0:00

15s threshold

TL;DR: We fine-tune one LoRA adapter per enterprise customer on top of a single Llama 3.1 8B base. Running them as 40 separate deployments would have cost roughly $24k/month in mostly-idle GPU. Multi-LoRA serving in vLLM put all 40 on two A100s. Numbers and the parts that broke below. At Nexus Labs we run the fine-tuning and eval team for agent automation. Each enterprise customer gets its own adapter because each has a different tool schema and a different house style for responses. Right now that's 40 customers in production. Rank-16 LoRA, about 42MB per adapter on disk, trained with PEFT and TRL on their own trace data. The obvious setup is one model server per customer. That's 40 copies of an 8B base. In bf16 the base is around 16GB of weights before KV cache. Forty of those does not fit on anything we can afford, and most customers send fewer than 5 requests a minute. So you're paying for a GPU to sit at 3% utilization. We priced it at about $24k/month across the fleet on reserved A100s. No.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Serving 40 LoRA adapters on one base model: the throughput we got