How I finally stopped rewriting deployment configs every time I switched GPU providers

1 / 2

How I finally stopped rewriting deployment configs every time I switched GPU providers

DEV Community·yukixing6-star·29 days ago

#p4Kmhwv2

#gpu #devops #machinelearning #cloudcomputing #provider #workload

Reading 0:00

15s threshold

I’ve been running GPU inference workloads for about two years now and for most of that time I had the same problem: every time I wanted to move a workload to a different provider, I was essentially starting from scratch on the deployment config. Not because the actual workload changed. The code was the same, the container was the same. But all the infrastructure glue — the scheduling constraints, the node selectors, the provider-specific API calls, the health check logic — was baked into the config in ways that assumed a specific provider’s environment. Moving meant unpicking all of that and rebuilding it for wherever we were going. I tried a few things to fix this. Terraform helped with provisioning but didn’t solve the actual problem. I could terraform my way to nodes on a different provider. I still had to tell each workload where to run and update that when things changed. I tried writing an abstraction layer that sat between our deployment scripts and the provider APIs. That worked for a while.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How I finally stopped rewriting deployment configs every time I switched GPU providers