When someone uploads an hour-long podcast or a Twitch VOD to LumiClip , they expect ten short, vertical, ready-to-post clips back. Two pipelines do the heavy lifting: a highlight finder that decides what's worth clipping, and a reframer that turns landscape footage into something that looks native on a phone screen. Here's how each one actually works under the hood. The core problem with asking one model to do everything The first thing we tried was the obvious thing: prompt a capable LLM with the transcript and ask it to find the best clips. The signal-to-noise was terrible. A model looking at a raw hour-long transcript has no spatial sense of the video, no understanding of energy or pacing, and no way to know that two candidate clips are basically the same moment from different angles. So we scrapped that and built a small assembly line instead. Each step is cheap, focused, and only passes its survivors to the next stage.…