MotionStream: Real-Time Video Generation with Interactive Motion Controls

Summary Mindline

The central idea of MotionStream is to transform motion-controlled video generation from a slow, offline "render-and-wait" process into a real-time, interactive experience. The authors achieve this by first training a high-quality but slow bidirectional "teacher" model that understands text and motion, and then distilling its knowledge into a lightweight and fast causal "student" model. This student model can generate video frames sequentially (autoregressively) at interactive speeds (up to 29 FPS), enabling users to guide the creation of infinitely long videos on the fly using controls like mouse dragging, motion transfer, and camera adjustments.

Key Points

Problem: Existing motion-controlled video generation models are too slow (taking minutes per video), non-causal (requiring the entire motion path upfront), and limited to short clips. This prevents any form of real-time creative interaction.

Two-Stage "Teacher-Student" Solution:

Stage 1: Bidirectional Teacher Model: They start with a powerful text-to-video diffusion model and add motion control to it. Instead of a computationally heavy ControlNet, they use an efficient track encoding head with sinusoidal embeddings, which is much faster. This "teacher" produces high-quality video that follows motion and text prompts but is too slow for real-time use.
Stage 2: Causal Student Distillation: The slow teacher is used to train a fast, causal "student" model using a technique called Self Forcing with Distribution Matching Distillation. The student learns to generate video chunk-by-chunk, making it suitable for streaming.

Enabling Infinite-Length Video:

Generating long videos autoregressively often leads to quality degradation and "drift" over time. The paper introduces a key innovation inspired by large language models to solve this:

Attention Sinks: The model is forced to always pay attention to the initial frame's information (the "sink"). This acts as a constant anchor, preventing the model from losing context or drifting during long generation.
Sliding Window & Rolling KV Cache: To maintain constant speed, the model only attends to a small, fixed-size window of recent frames in addition to the initial "sink" frame. This ensures that latency and computational cost do not increase as the video gets longer.

Performance and Applications:

Speed: MotionStream achieves state-of-the-art results while being two orders of magnitude faster than previous methods, reaching up to 29.5 FPS with an optimized VAE decoder on a single H100 GPU.
Interactivity: This speed enables true real-time applications, such as interactive drag control (pulling an object in a scene), live motion transfer from online trackers, and dynamic camera control, where the user sees the results unfold instantly.