Built this for the AMD x lablab hackathon. One English sentence becomes a 30-second cinematic reel with characters, story, music, and per-shot voice-over. ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT. Code: github.com/bladedevoff/studiomi300 Architecture The Director also doubles as the Vision Critic — same Qwen3.5-35B checkpoint reloaded with a different system prompt, two roles. Saves 70 GB of VRAM. Why a single MI300X 192 GB HBM3 lets four very different architectures share one card sequentially: 35B MoE director, 4B diffusion, 14B I2V MoE, 3.5B music, 82M TTS. On 24 GB consumer hardware this stack needs 4-5 separate machines wired together. Models unload between phases via gc.collect() + torch.cuda.empty_cache() . The Director runs in a subprocess so its full memory frees on exit before Wan2.2 loads, otherwise OOM. What the Vision Critic does Most generative video pipelines render once and pray.…