Why Merged LoRA Barely Changes Inference Time

1 / 2

Why Merged LoRA Barely Changes Inference Time

DEV Community·Natnael Alemseged·27 days ago

#l1FrC43H

#machinelearning #llm #model #lora #merged #base

Reading 0:00

15s threshold

While my peer was benchmarking a sales conversion classifier fine-tuned on Qwen1.5-0.5B-Chat, a merged LoRA version of the model took 14,228 ms per task while the bare base model took 14,045 ms. That 183 ms gap is only about 1.3%. Why doesn't merging in extra trained weights make inference slower? And if the adapter is not the thing driving latency, what actually is? The short answer is: once LoRA is merged, the model is no longer doing "base model plus adapter" at inference time. It is just doing the base model computation with a different set of weight values. The tensor shapes do not change, the number of layers does not change, and the number of bytes that must be moved for each generated token is almost the same. On modern GPUs, that last point matters most. One caution upfront: with only one timing run per system on a shared Colab T4, you cannot prove that 183 ms is "real." A 1.3% gap is plausibly noise , not evidence that merged LoRA adds meaningful latency.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Why Merged LoRA Barely Changes Inference Time