ran this into the ground before finding something that works at production volume. writing it up because the standard recommendations don’t account for what happens when Chinese models are doing real inference work at real scale. the problem: running DeepSeek V3 for cost-sensitive tasks, Qwen 2.5 for multilingual, GPT-4o for the rest. three providers, three sets of credentials, three rate limit systems, three integrations that break on independent schedules when providers push updates. the “just use an API aggregator” answer works for the western model side. for DeepSeek and Qwen specifically the latency is higher than acceptable because aggregators are proxying API calls rather than handling compute at the infrastructure level. the per-token pricing at production volume also compounds in ways the headline rates don’t communicate. the DIY routing layer approach worked until DeepSeek pushed an API update on a Friday. spent the weekend fixing an integration that had nothing to do with our actual product.…