Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

1UCLA 2ByteDance Seed 3University of Central Florida Project Lead

Minute-Scale (1Min40Sec) Video Generated By Our Model.
(This webpage contains a lot of videos. We kindly ask for your patience and suggest using Chrome for the best experience.)
SkyReels
MAGI-1
CausVid
Self Forcing

A vibrant tropical fish glides gracefully through colorful ocean reefs, surrounded by swaying coral, shimmering schools of tiny fish, and beams of sunlight filtering down from the water’s surface. The scene feels alive with movement, as bubbles rise gently and the reef glows in vivid shades of blue, orange, and pink, creating a tranquil yet dynamic underwater atmosphere.


Minute-Scale (3Min44sec) Video Generated By Our Model.
SkyReels
MAGI-1
CausVid
Self Forcing

Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.


Minute-Scale (4Min15sec) Video Generated By Our Model.

Note that 4 Minutes 15 seconds is 99.9% of the longest video base model's Positional Embedding can support.

SkyReels
MAGI-1
CausVid
Self Forcing

A massive elephant walks slowly across a sunlit savannah, dust rising around its feet, the warm glow of sunset illuminating the horizon; the camera moves steadily forward alongside, emphasizing the grandeur of its stride.


Abstract

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20X beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency.


1X: More Samples for 5 seconds


Cinematic closeup and detailed portrait of a reindeer in a snowy forest at sunset. The lighting is cinematic and gorgeous and soft and sun-kissed, with golden backlight and dreamy bokeh and lens flares. The color grade is cinematic and magical.

Miniature adorable monsters made out of wool and felt, dancing with each other, 3d render, octane, soft lighting, dreamy bokeh, cinematic.

Bionic prosthetic hand on dark background.

Macro cinematography, slow motion shot: A sculptor's hands shape wet clay on a wheel, and as the wheel spins. Camera captures the tactile quality of the clay and the fluid motion of the sculptor’s hands.


Panda bear wearing gold-plated stiletto shoes strutting with a sassy demeanor through a haute couture runway.


Intuition Behind Our Method ✨

Bi-directional diffusion can be seen as a process of gradually restoring a degraded target. We adapt it to autoregressive generation by having a short-horizon teacher refine the student’s outputs and then distilling these correction knowledge back into the student model.


10X: More Samples for 50 Seconds


A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.


An astronaut runs on the surface of the moon, the low angle shot shows the vast background of the moon, the movement is smooth and appears lightweight.


A stop motion animation of a flower growing out of the windowsill of a suburban house.


Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.


A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.


20X: More Samples for 1 Minute 40 Seconds (~100 Seconds)


A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast Italy, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography.



An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.



Visual: A night scene in a city with wet streets reflecting city lights. The camera starts on the reflection in a puddle and pulls up to reveal the source of the reflection—a glowing neon sign—then continues to pull back to show the rain-soaked streets. Camera Movement: Start focused on a close-up of the puddle's reflection, then pull up and back in one fluid motion to reveal the full context of the rainy cityscape.


50X: More Samples for 4 Minutes 15 Seconds

Both Rolling Forcing and LongLive , as well as Self-Forcing++(Ours) , are able to generate high-quality videos up to multiple minutes long, which marks a significant advance in autoregressive long video generation compared to previous methods.

LongLive

Ours

SkyReels
MAGI-1
CausVid
Self Forcing

A cinematic third-person shot of a wingsuit flyer racing through a narrow mountain valley. The flyer dives downwards, weaving smoothly between jagged cliffs as snow-capped peaks tower in the background. The wingsuit’s fabric ripples in the wind while the camera tracks from behind, emphasizing speed and freedom.


LongLive

Ours

SkyReels
MAGI-1
CausVid
Self Forcing

Flying through meadows of early flowering plants and mossy tundra with fallen logs, pollen and dust


LongLive

Ours



A pod of dolphins leaps out of the sparkling ocean in graceful arcs, splashing back into the water as the horizon glows with sunset; the camera follows from the side, keeping a continuous rhythm with their motion.

LongLive

Ours



A timelapse of night falling in the desert, stars igniting brilliantly above.


LongLive

Ours



A volcano erupts in the distance, glowing lava rivers flowing against a darkened sky.


LongLive

Ours


Cinematic FPV aerial shot flying forward over snow-capped mountains at golden hour, skimming along a razor ridgeline then dipping into a glacier valley; wide-angle 24 mm feel, gentle spindrift, volumetric light, crisp atmosphere, 4K, 60 fps, continuous take, no cuts, smooth trajectory.


Reproduce Our Results

Our work builds directly on Self Forcing and can be readily reproduced by following the algorithm below.

In short, it can be implemented on top of Self Forcing by making the following 3 changes:

  1. Self roll out videos beyond the teacher's horizon with rolling KV Cache.
  2. Uniformly sample continuous latent frames from the long video and apply DMD with backward noise initialization.
  3. [Optional] Post train the model with GRPO with your preferred rewards.

Our method uses exactly the same inference code as Self Forcing.

Training Dynamics

Here we show that as the we scale up the training compute, both the image quality and motion quality increases.


A massive elephant walks slowly across a sunlit savannah, dust rising around its feet, the warm glow of sunset illuminating the horizon; the camera moves steadily forward alongside, emphasizing the grandeur of its stride.


Limitations & Future Works

Despite being able to generate videos lasting multiple minutes, our model solely relies on a 5-second short-horizon teacher and has never been trained on real data. However, our method do have the following limitations which we plan to address in future works

  • The model lacks long-term memory, causing occluded objects to change after being blocked for an extended period of time.
  • For extremly long videos, unoccluded objects may also gradually change due to underlying continuous value drifts.
  • Currently, the model is not able to perform multi-event generation with high quality.
  • The maximum length our model can generate is 4 minutes 15 seconds without modifying the positional embedding.