Most video‑large language models still operate on pre‑recorded clips, pausing after each inference. The emerging expectation that a model can watch a live feed and answer questions instantly has remained out of reach—until a system demonstrated continuous processing on a streaming pipeline. Earlier streaming attempts treated the visual front‑end and the language back‑end as separate stages, often limiting interaction to caption‑style narration or relying on explicit triggers before a response. Those designs struggled with open‑ended question answering and with maintaining context over long horizons. AURA unifies a video encoder with an LLM and adds a sliding‑window history that reuses prefix key‑value caches, yielding bounded latency. In practice the framework “supports a real‑time demo system with ASR and TTS running at 2 FPS on two 80G accelerators” [1] .…