Menu

Post image 1
Post image 2
1 / 2
0

Counterintuitive: WSL2 + vllm cannot fit Qwen2.5-7B-1M on 6GB VRAM where Windows transformers can

DEV Community·tomohiro takada·21 days ago
#MZs7clYM
Reading 0:00
15s threshold

TL;DR — I tried to run Qwen2.5-7B-Instruct-1M on a consumer laptop (RTX 3050 Laptop 6GB VRAM) and mapped the literal feasibility frontier. All evidence in JSON, drift-CI enforced. Three honest findings: 4k context = the hard ceiling on Windows transformers + bitsandbytes int4 NF4. 5k, 6k, 8k all OOM at the first attention forward pass. The 4k cell passes only because Windows kernel shared-memory PCIe spillover (WDDM overcommit) lets allocations spill to system RAM at ~10x latency tax — peak measured 10.8GB on a 6GB GPU. WSL2 + vllm cannot even fit the model. vllm 0.7.3 memory profile literal log: "model weights take 5.43GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is -0.94GiB ". 0 GPU cache blocks allocated, 0.00x concurrency at 4200 tokens. Linux nvidia driver does not provide an equivalent shared-mem fallback — vllm sees only physical 6GB and refuses.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More