I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.
The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.
The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.
Results:
- 3.2x faster inference
- 73% memory reduction
- Runs on ESP32-class hardware
Code:
from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")
Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |
GitHub: https://github.com/AmSach/kvquant
This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.

