I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

1 / 2

I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

DEV Community·Haru-neo·30 days ago

#opxc2Bn2

#ai #llm #opensource #software #qengine #korean

Reading 0:00

15s threshold

I bought four NVIDIA CMP 100-210 cards off the secondhand market for about $130 each. They are ex-mining cards based on the Volta GV100 die — same silicon as the V100 — with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for the price of a single used 3090. In practice, NVIDIA had crippled them in hardware. The throttle The CMP 100-210 has its tensor cores throttled 64×. HMMA latency is stretched from 8 cycles to 512. cuBLAS WMMA caps out at about 5 TFLOP per card. PCIe is locked to Gen1 x1, no P2P, no NVLink. CUPTI is blocked, so you can't even use NVIDIA's own profiler. The throttle is enforced by an e-fuse + PMU bootrom double-lock on the die. This isn't a firmware switch — it's blown into the silicon. There is no software unlock. (Yes, I tried.) The result: anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. That's vLLM, llama.cpp's default cuBLAS path, FlashAttention, bitsandbytes, PyTorch's default matmul.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards