Executive summary The shifting landscape of AI infrastructure reveals that bottlenecks are no longer found in raw compute, but in inference placement. As models scale, a unified, three-layer architecture (including hyperscale cloud, regional data centers, and edge nodes) is replacing the traditional “cloud vs. edge” debate. Because preprocessing and embedding are now primary bottlenecks, compute must live near the data source to reduce bandwidth costs. Distributed architectures mitigate power, cooling, and water use limits by spreading thermal loads across smaller facilities. Success depends on “placement flexibility” — the ability to route workloads based on payload size, hardware needs, and traffic spikes. Ultimately, maintaining a viable AI system requires a flexible control plane that can adapt as bottlenecks inevitably migrate across the infrastructure stack.…