--- title : " KV Cache Quantization for On-Device Android LLM Inference" published : true description : " A hands-on guide to fitting a 7B LLM into 4GB Android RAM using INT4 KV cache quantization, sliding window eviction, and ashmem memory mapping." tags : android, kotlin, mobile, architecture canonical_url : https://mvpfactory.co/blog/kv-cache-quantization-on-device-android-llm-inference --- ## What We Are Building By the end of this tutorial, you will understand how to run a 7B parameter LLM on a 4GB Android device without getting OOM-killed. We will walk through three techniques that work together: quantizing attention key-value caches from FP16 to INT4, implementing a sliding window eviction policy with anchor tokens, and using Android-specific `ashmem` memory mapping with `madvise` hints to keep your app's memory footprint safe. Let me show you a pattern I use in every project that involves on-device inference.…