Menu

Post image 1
Post image 2
1 / 2
0

Setting up Ray on GKE: How I spent a week optimising Docker pulls?

DEV Community·Emin Mammadov·23 days ago
#MqeEfXHp
#llmops#ray#kubernetes#gcp#cluster#node
Reading 0:00
15s threshold

I spent a week debugging slow Ray cluster starts on GKE. The fix was a region mismatch that is not very obvious from the docs. We've been running Ray on GKE (with Anyscale) for over a year on the AI Platform team at Geotab. As self-hosted LLM workloads grow, Ray is one of the tools that makes scaling them practical. Introducing Ray and making it a go-to platform for multiple teams has been a rewarding but challenging path. One issue I kept running into: slow Ray cluster spawn times. Here's where the time actually went, and what helped. 1. GKE node provisioning: 2-3 minutes When Ray's autoscaler asks for a new node, GKE has to allocate a VM, boot the OS, register the kubelet, and join the cluster. GPU nodes add another 30-50 seconds for driver install. We treated this as a baseline cost - no point optimizing anything else until the node exists. That recently changed a bit as GCP introduced GKE Active Buffer that aims to minimize that time. I haven't tested it yet, but it's on the list. 2.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More