Every GPU inference container has the same problem: Kubernetes HPA can't see the GPU. You scale on CPU and memory while your GPU sits at 95% utilization, completely invisible to the autoscaler. Or worse — your GPU is idle and you're paying $3/hour for an instance doing nothing. I built keda-gpu-scaler to fix this. It's a KEDA external scaler that reads real GPU metrics via NVIDIA NVML and drives Kubernetes autoscaling decisions — including scale-to-zero. This post covers the Docker-specific parts: how GPU metrics flow from the NVIDIA Container Toolkit through Docker to KEDA, and how to build GPU-aware containers that actually scale.…