Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)

1 / 2

Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)

DEV Community·Pavan Madduri·20 days ago

#6xSgfPZr

#bug #docker #gpu #kubernetes #fullscreen #nvidia

Reading 0:00

15s threshold

Running GPU containers on Kubernetes is one of those things that works perfectly in tutorials and then breaks in confusing ways on real clusters. I've been deploying GPU workloads on OKE for a few months now, and I've built up a decent collection of debugging war stories. This isn't a getting-started guide. This is the post I wish existed the first time I saw CrashLoopBackOff on a GPU pod with zero useful logs. Bug 1: Pod Stuck in Pending — "0/3 nodes are available" This was my first GPU deployment on OKE. Created a pod requesting nvidia.com/gpu: 1 , and it just sat in Pending forever. $ kubectl describe pod vllm-inference-0 Events: Warning FailedScheduling 0/3 nodes are available: 3 Insufficient nvidia.com/gpu Enter fullscreen mode Exit fullscreen mode Three nodes, but none had GPUs. Turns out I created the GPU node pool but it hadn't finished scaling up yet. OKE provisions GPU nodes on-demand when you create the node pool, and it takes 3-5 minutes for the instances to come up. Fix: Just wait.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)