Everything is KL minimization: The Current Autoregressive Video Generation Landscape
·8113 words·39 mins·
Author
Mark Ogata
AI and Robotics Undergraduate Researcher
Table of ContentsTable of Contents
Special thanks to Haven for guidance on my video generation journey!
Haven Feng
Also shoutout to Jameson for being a comrade on this quest!
Jameson Crate
Two months ago, I was a clueless undergrad diving into autoregressive video generation. Every paper I read claimed to be SOTA, but I couldn’t answer basic questions: Are these methods actually the same thing? Which is better in certain situations and why?
This post organizes my aha moments. It guides you through the space as if you’re discovering it for the first time, giving motivation for the why and how things relate to each other.
I first cover key papers and discoveries in image generation, then present a unified view of the autoregressive video generation landscape. Hopefully I can give a bird’s-eye view of the video generation space so it is easier for other beginners to get into it!
Before diving into video generation, we need to understand how image diffusion models work. If you’re already familiar with DDPM and DDIM, skip to video generation.
TL;DR: DDPM learns to reverse a noising process. The clever part: maximizing ELBO (minimizing KL divergences at each noise level) reduces to simple L2 loss on noise prediction. This makes diffusion models practical to train.
Denoising Diffusion Probabilistic Models (DDPM) introduced a principled framework for generating images by learning to reverse a gradual noising process. The key insight: if we can learn to denoise images at every noise level, we can start from pure noise and iteratively denoise to generate novel images.
▶Forward Process (Diffusion)
The forward process gradually corrupts data x0∼q(x0) by adding Gaussian noise over T timesteps:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
where β1,…,βT is a variance schedule controlling how much noise to add at each step.
The beauty of Gaussian noise is that we can sample xt directly from x0 in closed form:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
where αt=1−βt and
αˉt=s=1∏tαs
This means xt=αˉtx0+1−αˉtϵ where ϵ∼N(0,I).
As t→T, αˉt→0 and xt approaches pure Gaussian noise.
▶Backward Process (Denoising)
The backward process learns to reverse the diffusion:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
Starting from xT∼N(0,I), we iteratively denoise: xT→xT−1→⋯→x0.
The full generative process is:
pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt)
▶Training Objective via ELBO
We want to maximize the log-likelihood logpθ(x0), but this is intractable. Instead, we maximize the Evidence Lower Bound (ELBO) as a surrogate loss:
ELBO=very last stepEq[logpθ(x0∣x1)]−≈0DKL(q(xT∣x0)∥p(xT))−t=1∑T−1LtEq[DKL(q(xt∣xt+1,x0)∥pθ(xt∣xt+1))]
Intuition: Maximizing ELBO means:
Learn to reconstruct x0 from slightly noised x1
Match q(xT∣x0) to prior p(xT) (automatically satisfied if our forward process terminates at the unit gaussian and our reverse process starts with a unit gaussian.
At each timestep t, match the learned denoising distribution pθ(xt−1∣xt) to the true posterior q(xt−1∣xt,x0)
The key insight: we’re minimizing a weighted sum of KL divergences across noise levels.
▶From KL to L2 Loss
Both q(xt−1∣xt,x0) and pθ(xt−1∣xt) are Gaussian. For Gaussians, minimizing KL divergence reduces to matching means (when variances are fixed).
The true posterior mean can be derived using Bayes’ rule:
DDPM parameterizes the model to predict the noise ϵ:
μθ(xt,t)=αt1(xt−1−αˉtβtϵ(xt,t))
Plugging this into the KL term and simplifying yields the simple L2 loss:
Lsimple=Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
where t∼Uniform(1,T), x0∼q(x0), ϵ∼N(0,I), and xt=αˉtx0+1−αˉtϵ.
Key takeaway: Maximizing the ELBO (minimizing KL divergences at each noise level) reduces to minimizing L2 error in predicting the added noise. This simple loss is what makes diffusion models trainable in practice!
▶Alternative parametrizations: predicting x0, ϵ, score, etc. are equivalent
You can equivalently train the model to predict the clean image x0 instead of noise ϵ:
Lsimplex0=Et,x0,ϵ[∥x0−x^θ(xt,t)∥2]
Since xt=αˉtx0+1−αˉtϵ, we have x0=αˉtxt−1−αˉtϵ, so these are algebraically equivalent up to different noise level weightings. The score is able to be written in terms of error, and the flow can be written in terms of the error and x_0 so these are equivalent as well (Checkout the diffusionflow blog for why).
This lays the foundation for the argument that we should actually eliminate time from the noise schedule and just use the signal to noise ratio (I agree btw).
See diffusionflow blog and EDM paper for a unified view and more on the argument for using signal to noise ratio instead of time.
We use x0 prediction throughout this post for consistency.
TL;DR: DDIM’s breakthrough: diffusion can be modeled as an ODE, not an SDE. Same training as DDPM, but 10-50x faster sampling by removing stochasticity. This deterministic view is the foundation for modern samplers.
Denoising Diffusion Implicit Models (DDIM) made a crucial discovery: DDPM’s reverse process doesn’t need to be stochastic. By reparameterizing the reverse process, we can prove deterministic sampling creates the same marginal distributions (images over noise levels) as DDPM.
DDPM vs DDIM sampling paths in 1D. Notice how the randomness of the paths change but the probability density at every noise level remains the same regardless of the sampling method.code ↗▶The Key Innovation: Controlling Stochasticity
DDPM’s reverse process (backward step) is:
q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)
where the mean is deterministic but there’s always stochastic noise β~t.
DDIM generalizes this by introducing a parameter σt that controls how much stochasticity to add:
qσ(xt−1∣xt,x0)=N(xt−1;μσ(xt,x0),σt2I)
where the mean is:
The magic: the marginal distributions qσ(xt) are identical for any choice of σt, as long as σt2≤1−αˉt−1.
▶Why do the marginals stay the same?
The proof involves showing that the non-Markovian forward process qσ(x1:T∣x0) can be constructed such that each qσ(xt∣x0) matches DDPM’s q(xt∣x0)=N(αˉtx0,(1−αˉt)I), regardless of σt.
The key is that the forward process becomes non-Markovian: xt depends on x0 in a way that compensates for the reduced stochasticity in the reverse process. See DDIM Appendix B or Lilian Weng’s blog for the detailed proof.
▶From Stochastic to Deterministic
The critical insight: we can set σt=0 to get a deterministic reverse process:
This is an ODE (ordinary differential equation) instead of an SDE (stochastic differential equation)! Given the same starting noise xT, we always get the same output x0.
Special cases:
σt=β~t: Recovers DDPM (fully stochastic)
σt=0: Pure deterministic sampling (DDIM)
0<σt<β~t: Interpolates between the two
▶Why This Matters
Faster sampling: ODEs can be solved with fewer steps than SDEs. DDIM can generate high-quality images in 10-50 steps vs DDPM’s 1000 steps.
Deterministic trajectories: Same noise → same image. This enables:
Consistent interpolation between images
Reproducible generation
Easier theoretical analysis
Semantic interpolation: We can interpolate in latent space xT and get meaningful interpolations in image space x0.
Foundation for future work: The deterministic ODE view opened the door for continuous-time formulations (score matching, flow matching, rectified flows) and better distillation methods.
▶DDPM vs DDIM: Training and Sampling
Training: Both use the exact same training objective! You train one model with the standard L2 loss.
Sampling: The difference is only at inference time:
DDPM: Add stochastic noise at each step (σt=β~t)
DDIM: Remove stochasticity (σt=0) and use larger timesteps
You can use either method at sampling time, no need to retrain!
Key takeaway: DDIM showed that diffusion models are solving an ODE, not fundamentally requiring stochasticity. This deterministic view is conceptually cleaner and practically faster, making it the foundation for modern diffusion samplers.
▶The Natural Euler perspective
Speaking of conceptually cleaner, we can massively simplify our training by getting rid of the alphas and betas. Any linear transformation of the noise schedule is equivalent, so we can use the linear schedule (using t and 1-t) and retroactively (like our DDPM DDIM stochasticity) decide what noise schedule we want to use at inference time.
Straighter lines = lower integration errors when we sample our ODE = less steps needed for the same quality = faster inference time with less compute
The rectified flow blogs and sander.ai blog go into great detail and have fantastic explanations for this so I will link them here. Please check them out.
Comparison of different noise schedules. The blue line (standard deviation of the noise factor) shows the trajectory your samples take along a conditional vector field. The green line Var(xt) shows the trajectory samples follow when source and target distributions are both unit gaussians.Noise schedules considered harmful ↗
One tip for visualizing the schedules is that the blue line (standard deviation of the noise factor) is the trajectory your samples along a conditional vector field (if your target distribution is just a single point). The rectified flow in this case gives you the straight line for this case (rightmost graph). The green line Var(xt) gives you the trajectory samples will follow if your source and target distributions are both unit gaussians. The cosine schedule gives the straightest lines in this case.
Determining the schedule that yields the straightest path is not trivial for arbitrary distributions, and hence the need for hyperparameter search to yield the straightest green lines from your target distribution to your noise distribution. Your target distribution most likely doesn’t have the symmetries of a gaussian distribution, so the optimal noise schedule still will most likely not get you exactly straight lines.
Now that we understand how image diffusion works (minimizing KL divergences via L2 loss), let’s see how these same principles extend to video generation. The key question: when generating videos frame-by-frame, how should we condition on past frames? Should they be clean, noisy, or generated?
This leads us to different training methods (BiD, TF, DF, SF) paired with different losses (L2, GAN, DMD). At first glance, they seem very different. But here’s the surprising insight: they’re all optimizing the same thing—just weighted differently.
A Unified Perspective of Video Generation Methods#
In the diffusion world, it turns out over and over again that there are many different seeming methods that all turn out to be equivalent with some small difference in the details.
One example is that different parametrizations of diffusion models (x0 prediction, e predictions, v prediction), flow matching, and score matching are equivalent. Using each of these objectives turns out to just emphasize different noise levels over others (see my explanation above for details). So from now on when we say predict x0 (or any of the other objectives) and compute the loss, know that you can just swap in a different parametrization and you would end up with the same thing up to some changes in which noise levels our loss emphasizes.
Another example is that different noise schedules (as long as they are an affine transformation of one another) are equivalent. You can train your model using the rectified flow schedule (the t and (1-t) one) and later use a different schedule to sample no problem. So you might as well just train using the simple noise schedule and then search for the best schedule after training to minimize ODE solver errors.
The rectified Diffusion blog has a great explanation for this.
Yet another example is the equivalence between DDPM and DDIM (stochastic sampling vs deterministic sampling) explained above
So when confronted with different video generation training methods:
Teacher Forcing with L2 loss (TF+L2), Diffusion Forcing with L2 loss (DF+L2), Self Forcing with DMD loss (SF+DMD), Self Forcing with GAN loss (SF+GAN), Bidirectional training with L2 loss (BiD+L2), Approximated CausVid (DF+DMD) …
it is natural to wonder: are they equivalent somehow?
The answer is yes… (almost)
They are almost all optimizing a combination of KL divergences across noise levels:
The Unified Framework: Method + Loss → KL Divergence#
Legend:
🔵 Forward KLKL(pdata∥pmodel): Mode-covering, trains on ground truth data
🔴 Reverse KLKL(pmodel∥pdata): Mode-seeking, trains on generated data
🟣 JSD: JSD(pdata∥pmodel): Symmetric mix of the two
DF (Diffusion Forcing): Train on ground truth frames with independent noise per frame
SF (Self Forcing): Train on model-generated frames (self-rollout)
TF (Teacher Forcing): Train on ground truth frames, sequentially denoise last frame only
BiD (Bidirectional): Train all frames synchronously at same noise level
Note: 🟡 means hybrid that doesn’t fit the forward/reverse KL framework cleanly.
Key Patterns to Notice:
All L2 methods (DF, TF, BiD) → 🔵 Forward KL (train on ground truth)
SF + DMD → 🔴 Reverse KL (train on generated data)
SF + GAN → 🟣 JSD (measuring divergence of data and model vs a mix of the two)
Training method = what data you train on, Loss = what objective you optimize
▶What is Eforward and Ebackward ?
Eforward is the expectation over the forward process. This forward process is just how noisy latents are generated. So the forward process in DF would be adding an independent random noise level to every ground truth frame. In SF the forward process is adding an independent random noise level to every generated frame. In TF the forward process is choosing a random frame, making all frames after fully noised, and making the chosen frame some random amount of noise.
The definition of the forward process is inspired from the DF paper:
In fact, all I did was translate the concept from the DF paper to the other methods.
Ebackward is the expectation over the backward process. This backward process is when the model partially denoises frames. The model starts with noisy latents and generates samples by denoising them to various noise levels (or completely to clean frames).
Both are ways to generate noisy latents, they just differ in whether it’s from the model distribution or data distribution
▶What is KL and JS?
They are a way to measure distance between distributions.
I highly recommend watching this video to understand the KL divergence. This blog also seems to have good intuitions for what it is.
KL divergence is directional, so it is common to see:
KL(pdata∥pmodel)=KL(pmodel∥pdata)
The first is the forward KL. The second is the reverse KL.
If two distributions are identical the KL divergence is 0. This is important because if every method is minimizing a combination of KLs, the most optimal model for each method creates the same distribution. In general, divergence measures are 0 if two distributions are identical.
The loss landscape would change depending on what the exact divergences are being optimized but the optimal distributions created by the model is identical across methods.
The loss landscape often matter a lot however, as things such as limited model capacity, compute, data, early stopping, and often lead to suboptimal models in practice.
Before diving into the proofs, let’s build intuition for what each training method and loss function actually does. Understanding these building blocks will make the unified framework clear.
What are the training methods for video generators today (BiD, TF, DF, and SF)?#
TL;DR: The training method determines how you condition on past frames:
BiD: Denoise all frames simultaneously (like image diffusion, but for videos)
TF: Use clean ground truth context, denoise one future frame
DF: Use noisy ground truth context (noise = continuous mask)
SF: Use self-generated frames to eliminate exposure bias
These methods are just different ways to condition your frames. The frames with t subscript are noisy and the ones without are clean.Self Forcing ↗▶What is BiD (synchronous fixed length bidirectional training)
When you think of video generation, some things that come to mind may be Sora2 or Veo3.
These are called bidirectional Video generators. They denoise a preset number of frames into a fixed time-length video by denoising every frame at the exact same time. This is a very straightforward way you could imagine just porting over how you train image diffusion models but just doing multiple images at the same time.
If industry is already creating coherent meme videos with the bidirectional method, why do we need another method?
Can’t we just press the scale compute button and fix all our problems in video generation?
Here are some problems:
You can only generate a fixed length video
You can not stream the video as it is being generated: you have to wait until the whole thing has been made before starting to watch the clip
You can’t interact with the video as it is being generated (imagine being able to press different arrow keys to make the camera angle of the video move)
Causality should flow forward in time but bidirectional models let past frames depend on future frames (This is more of a theoretical/philosophical complaint)
Some things that need these capabilities are:
World models (simulating the world to train other AI policies like robots)
Low latency video generation (Just like youtube plays your video before it fully loads)
variable length video generation
Chain of thought using images (This is an emerging field but perhaps models can think better for some things using images)
▶What is TF (Teacher Forcing)
Autoregressive video generation (AR video gen) addresses attempts to solve all of the above problems while still having high quality and fast generation of videos.
The first AR video gen method we may think of could be inspired by the LLM field.
In the LLM field, the transformer decoder architecture with a causal mask has been used to predict the next language token given the previous language token. If we steal this idea we can come up with the following:
predict the next image frame given the previous video frames.
If we implement this idea, we would give our models some context frames from our video dataset and then present it with a frame with gaussian noise and ask it to predict the denoised version.
But as we learned from diffusion and variational autoencoders, asking the model to predict the clean image given noise in one step leads to blurry images. So we can use the idea of decreasing noise levels from image generation to add some amount of noise to the last frame and predict the clean frame.
This is called teacher forcing. The definition of teacher forcing is predicting a continuation of the sequence given ground truth (teacher) frames as conditioning.
This is exactly how LLM pretraining is done.
The astute reader may notice the training with teacher forcing for video has trouble in parallelization compared to language modeling. Teacher forcing in language with transformers can just have an upper triangular binary causal mask to train O(N) examples at the same time. (e.g. “The cat in the hat” snippet with one forward pass would predict The -> cat, The cat -> in, The cat in -> the, The cat in the -> hat)
But for video generation we denoise with different levels of noise. So our context will be full of partially noised video frames if we tried to parallelize our training naively.
This can be overcome with a special attention mask where we have the ground truth frames first, then concatenate frames with varying amount of noise after and having a special teacher forcing attention mask:
The frames with t subscript are noisy and the ones without are clean.Self Forcing ↗
The goal here is to let every noise frame look at all previous clean frames only and not look at previous noisy frames.
With this attention fix we are able to train our teacher forcing video generation model by combining ideas from both LLMs and image diffusion.
But we have two problems:
Our attention mask seems more complicated than it needs to be (our attention mask is roughly 4 times bigger than a simple causal mask)
As we roll out our video using this model, we find that the quality of frames gets bad very quickly the longer we generate for
Exposure bias is when a model is trained on ground-truth data, but at inference time it must rely on its own predictions as input. This is a problem because errors can accumulate over time and the past frames we condition on look nothing like any video the model was trained on so the model is out of distribution.
This is also true for language models, but the space of language is much more compressed than images, and so just brute force training on massive data seems to have brought almost all language context into distribution. To see exposure bias yourself try out [youaretheassistantnow.com] to see how fast you can break the language model with out of distribution context. The website swaps the role of the User and Assistant, so acting very different than a “helpful assistant” to the models queries reveals how exposure bias can break autoregressive generation by taking the context out of training distribution.
▶What is DF (Diffusion Forcing)
A way we could kill two birds with one stone is to allow different noise levels in our history.
First, our attention mask can be the simple binary causal attention mask from language modeling or no mask at all.
This is because our context can consist of frames at independent random noise levels, so we no longer have to hide the noisy past frames we want to train in parallel from each other. Noise serves as a continuous mask with no noise being no mask and pure noise being a complete mask.
Autoregressive generation with diffusion forcing. Noise serves as a continuous mask, allowing parallel training without complex attention masks.Yin et al., 2024 (CausVid) ↗
Second, our model trains on random levels of noise injected into our history, which we can set at test time to have a bit of noise (maybe add a bit of noise to each frame the model generates before feeding it back to the model). This brings the distributions of the history when using self generated frames vs ground truth frames closer (helping exposure bias).
Adding noise to generated frames helps mitigate exposure bias by bringing the distribution of self-generated frames closer to the training distribution.
This means we can inject a little bit of noise to our past frames when generating videos to help keep our self generated video (self rollout) in distribution.
As a bonus we unlock different denoising strategies at inference time, which reveal that teacher forcing and bidirectional video generation is actually a subset of diffusion forcing.
Teacher forcing = setting all context frames to no noise and sequentially denoising the last frame
Bidirectional video generation = setting all frames to full noise and then denoising all frames in lockstep.
We also unlock a strategy to tradeoff denoising timesteps and latency:
We can have a monotonically decreasing noise level. Lets say we do a context length of 5 frames and have noise levels of [0, 0.25, 0.5, 0.75, 1].
We can do an inference step and then slide our window forward by 1 frame. This allows us to produce 1 frame per function evaluation.
The larger the context length we go from 0 noise to 1 noise level ([0, 0.5, 1] vs [0, 0.2, 0.4, 0.6, 0.8, 1]), the more frames it takes for a user conditioning input (such as an arrow key press) to cause a change in a generated frame, but allows the model to use more denoising steps to create higher quality videos.
So cascading noise inference allows us to keep throughput high, while trading off latency and denoising steps.
Naturally, this leads us to want to reduce the number of denoising steps to get latency down as much as possible! If we get to one-step denoising our cascading noise inference reduces to teacher forcing inference (Note even if the inference method is the same as teacher forcing here, our training method is different!)
It seems like we found a unifying representation to encompass many of the video generation training methods so far, but we still have some challenges:
We still haven’t solved the root cause of exposure bias, leading to our generated videos still degrading in quality over time (although it seems to slow down the degradation)
Our compute requirements grow as O(N2) with the length of our video, making long video generation a challenge and often leading to videos not consistent over long time periods.
We need to explore distillation for video generation so we can improve our latency (we want our world models to be as reactive as possible to our inputs)
The first exposure bias problem is tackled by a new training method called self forcing.
The second distillation problem is tackled by causvid.
The third long term consistency is problem is tackled by rolling forcing.
All of these have issues of their own and do not fully solve the problem they set out to address yet. I write about each below so feel free to skip around depending on which problem you are interested in.
▶What is SF (Self Forcing)
To solve exposure bias you might think of the following:
If having self generated frames at inference time is causing exposure bias, why don’t we have self-generated frames during training to get rid of this?
To do this you would generate frames from the model, and then evaluate the generated video as a whole with a loss to tell the model how to nudge the video towards something that has a higher likelihood under the real data distribution.
But a big problem is what loss to use. Since the model generates a sample video from its distribution, we don’t have the corresponding ground truth video from the real data distribution we can just take an L2 loss with.
Consider this example to make the problem more clear:
*You decide to become an animator, and want to create animations like those stickfigure fighting videos on youtube.
You decide to improve by drawing some animations yourself. But how do you know exactly how to improve your animation? There is no ground truth version of the animation you created you can compare it to. Maybe you can look through the channel for something similar to what you drew, but you will never find an exact ground truth corresponding to what you drew. So a simple L2 loss is impossible because you can not find the ground truth version of what you made. You need a holistic loss, maybe one that compares the distribution of animations you create vs the ones on the youtube channel and gives you a gradient*
Self forcing solves this by using 3 different distribution level losses (DMD, SiD, GAN).
We go into detail about these losses below.
The main takeaway is that self forcing (SF) does self-rollout (generating frames from scratch) to eliminate exposure bias.
▶Aside: Self Forcing is not new
Self Forcing is taking the idea done by professor forcing and adapting it to videos. Professor Forcing didn’t get much traction, most likely because just brute forcing teacher forcing with lots of compute and data seems to generalize well enough
“our training objective matches the holistic distribution of the entire video sequence to the data distribution. In contrast, TF/DF can be understood as performing frame-wise distribution matching” - Self Forcing paper
But we show in this blog post that the connection goes much deeper than that.
What are some of the different losses used for video generators today (L2, GAN, DMD, SiD)?#
TL;DR: The loss function determines which direction of KL divergence you optimize:
L2: Need corresponding ground truth data → Forward KL
DMD: Need ground truth score (teacher model) → Reverse KL
GAN: Need ground truth dataset → mix of both
SiD: Under construction
▶What is L2 loss?
L2 loss (mean squared error) is the workhorse of diffusion models. The idea is simple: you have a ground truth video x0 from your dataset, you add some noise to get xt, and then you ask your model to predict what the clean video looks like. The L2 loss measures how far off the model’s prediction x^0 is from the actual clean video:
LL2=∥x^0−x0∥2
This works great when you have the ground truth x0 to compare against. This is why L2 pairs naturally with teacher forcing (TF), diffusion forcing (DF), and bidirectional training (BiD) - in all these methods, you start with real data from your dataset and corrupt it with noise. You always know what the “right answer” is.
The magic is that minimizing L2 loss in this diffusion setup is equivalent to minimizing the forward KL divergence KL(pdata∥pmodel) assuming gaussian diffusion, which we show in the sections below.
▶What is GAN loss?
GAN (Generative Adversarial Network) loss is a loss you can use when you don’t have ground truth to compare against. This is the situation in self forcing: your model generates a video from scratch, and there’s no corresponding “correct” video in your dataset to take an L2 loss with.
The GAN solution is clever: instead of comparing your generated video to a specific ground truth, you train a discriminator network to distinguish between “real” videos from your dataset and “fake” videos from your generator. The discriminator learns to spot the difference, while the generator learns to fool the discriminator.
The loss has two parts:
Discriminator loss: Correctly classify real vs fake videos
Generator loss: Fool the discriminator (make your generated videos indistinguishable from real)
The original GAN paper proves that with optimal training, this adversarial game minimizes the Jensen-Shannon divergence JSD(pdata∥pmodel). The JSD is defined as:
where M=21(pdata+pmodel) is the mixture distribution (the average of the two distributions). Unlike simply combining forward and reverse KL, JSD measures how far each distribution is from their midpoint, making it symmetric and bounded.
In our video generation context, we use Diffusion-GAN’s trick of discriminating at random noise levels rather than just clean videos. This gives the generator learning signals across the entire denoising trajectory, resulting in the expectation of JSD over noise:
Enoise[JSD(pdata∥pmodel)]
▶What is DMD loss?
DMD (Distribution Matching Distillation) loss is like the “reverse” version of L2 loss. While L2 works when you have ground truth to compare against, DMD works when you’re doing self rollout and generating videos from scratch - exactly the situation in self forcing.
The key idea is simple: train two diffusion models to estimate score functions:
sreal: a diffusion model trained on real videos from your dataset
sfake: a diffusion model trained on fake videos generated by your model
Then use the difference between these scores as your learning signal:
∇θLDMD∝sfake(xt,t)−sreal(xt,t)DMD training framework. Two diffusion models estimate scores for real and fake distributions, and their difference provides the learning signal.Yin et al., 2023 (DMD) ↗▶Why does this work?
The DMD paper 3.2 shows this difference approximates the gradient of KL(pfake∥preal) - the reverse KL divergence. Intuitively, sreal tells you “how to make this video more realistic” and sfake tells you “how to make this video more fake”. The difference nudges your generator away from fake and toward real. If you train sfake on your current generator outputs often, it will be a very good approximator of your generators scores. The reason why we cant just use the generators score directly is because our generator is a one (or few) step model so we cant query the score at an intermediate noise level.
Visualization of how the score difference (fake minus real) provides gradients that move the generator distribution toward the real data distribution.Yin et al., 2023 (DMD) ↗
Just like with GANs, we evaluate this at random noise levels across the denoising trajectory, giving us the expected reverse KL over noise levels:
Ebackward[KL(pfake∥preal)]
The reason why we want to evaluate the loss at different noise levels is because for example if our generator is randomly initialized, our p_model will have very little overlap with p_data and so the scores are not well defined almost everywhere. (an intuitive example would be paintings by me vs by Van Gogh would look much more similar if we add a bunch of noise to them :) )
This makes DMD the natural pair for self forcing: SF handles the sampling (generating from scratch), while DMD handles the loss (learning without ground truth). Together they optimize reverse KL, which we show in detail below.
▶Why not just use GANs for self forcing?
You absolutely can! Self Forcing + GAN is a valid combination that optimizes JSD. DMD offers some practical advantages though: diffusion models are often easier to train than discriminators, and the score difference gives a more stable learning signal. But both are solving the same fundamental problem of learning without ground truth.
The explicit gradient formula for DMD (from DMD page 5) is:
TL;DR: DF trains on noisy ground truth frames. With L2 loss, this is equivalent to minimizing KL(pdata∥pmodel) at each noise level (the same forward KL that DDPM optimizes for images).
DF + L2 is exactly what is implemented in the Diffusion Forcing paper. The algorithm is as follows:
Sample a video x0 from your data distribution
add independent noise per frame via the forward process
Predict the L2 loss between the models prediction of the clean video and the actual clean video
(note loss can be evaluated on the clean video or the noise added or the velocity. they are all equivalent up to a weighting of noise levels)
The paper goes into detail deriving showing that this algorithm is equivalent to minimizing Expected forward KL over noise levels assuming gaussian diffusion in appendix A.1. The paper justifies minimizing Expected forward KL over noise levels as a surrogate loss because it is an evidence lower bound for the expected log likelihood of the data under the forward process.
This is all well and good, but the ultimate result we care about is the expected log likelihood of the clean data. We don’t care about the log likelihood of the noisy latents.
If we start from trying to maximize the expected log likelihood of the clean data (don’t care about the noisy latents), we can derive a bit more general training process depending on the trajectories through noise that we care about.
TF is a subset of DF so this should not be surprising. But instead of doing an expectation over the whole DF forward process our forward process allows a subset of the states of DF, so we only have a subset of the forward KL terms.
Since the only difference is the order in which frames are denoised this is still a forward KL loss.
Just like TF, BiD is a subset of DF. but it’s selecting the conceptual opposite of TF by denoising everything at the same time instead of denoising one at a time sequentially.
Since the only difference is the order in which frames are denoised this is still a forward KL loss.
TL;DR: SF trains on self-generated frames (no ground truth). DMD provides learning signals by comparing fake vs real score functions. Together they optimize reverse KL: KL(pmodel∥pdata)—the opposite direction of L2 methods.
In the training algorithm presented in the Self Forcing paper, they sample a scalar timestep that is the same across frames, but in the actual code they sample independent timesteps per frame. This is an important distinction because if each frame didn’t have their own timestep like the concept in diffusion forcing, our autoregressive inference algorithm would always be out of distribution—the context frames would always be clean regardless of the denoising timestep for the very last frame we are denoising.
Suggested algorithm correction: moving the timestep sampling inside the first for loop to enable independent timesteps per frame.
This notational inconsistency could be easily fixed by moving the timestep sampling inside the first for loop.
TL;DR: SF generates fake videos. A discriminator learns to distinguish fake from real at random noise levels. This classic adversarial setup optimizes the expectation of Jensen-Shannon Divergence over noise levels.
SF+GAN algorithm
Full proof: SF + GAN minimizes JSD over noise levels
The GAN loss for images is known to reduce to the JSD (feel free to see the proof in the original GAN paper )
So this result should be very intuitive as we are doing the straightforward way to train video generators with GANs with the small change of discriminating at random noise levels which results in an expectation of JSD over noise levels.
TL;DR: Causvid is a distillation method that pairs DF with DMD loss. Two versions exist:
Approximated (DF+DMD): Uses ground truth frames as a computational shortcut. Creates a hybrid distribution that doesn’t fit cleanly into forward/reverse KL framework.
We saw that DF is to L2 loss as SF is to DMD loss. DMD is like the “reverse” version of L2, useful when you don’t have the ground truth available because your generator is doing self rollout to generate data during training.
When you combine DF and DMD, you get something that is a hybrid between reverse and forward KL.
Causvid explores this and presents it as a distillation method. However, in the code, the justification for using the DF framework and using the forward process to generate noisy samples from the ground truth data distribution is that it is an approximation to generating samples from self rollout citing DMD2 section 4.5. So unapproximated Causvid is self forcing. If you use self rollout to generate samples and use DMD, we established in SF+DMD that this is optimizing the reverse KL divergence. The option to use the approximation or not is in the causvid code here and the approximation is enabled for the causal distillation presented in the paper
Ok, but is the approximated Causvid which is DF+DMD representable by a linear combination of forward and reverse KL? The answer is no.
DMD is minimizing the reverse KL between p_data and p_model. This is because we are marginalizing over the model distribution and calculating the difference in scores. But DF+DMD messes with the distribution we are marginalizing over because samples are created by picking a random timestep and then doing the forward process to that step and then a backward process. Mathematically this is writable as a pushthrough distribution but is not equal to either p_data, p_model, or a linear combination of the two. The detailed proof is in the appendix.
Conclusion: Causvid without the self rollout approximation is algorithmically identical to SF+DMD. Causvid with self rollout approximation is DF+DMD and is not representable as a simple linear combination of forward and reverse KL divergences. I have yet to prove if this can be represented as an f-divergence.
@article{ogata2025arvideo,title="The Autoregressive Video Generation Landscape (everything is min KL?)",author="Ogata, Mark",journal="markogata.com",year="2025",month="October",url="https://markogata.com/projects/2025/arvideolandscape/"}
[14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014.
[21] Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, Tim Salimans (2024). Flow Matching: A Unifying View. Project Page.
This appendix contains detailed mathematical proofs for the claims in the main text. These derivations are optional—the main insights are captured above.
The training procedure of unapproximated causvid involved the following:
self-rollout, samples are generated by denoising pure noise using the generator.
A guess of the clean image is sampled from the model at a random timestep during this generation.
This prediction is noised using the forward process to a randomly picked noise level t:
xt∼F(Gθ(z),t)
and DMD subtracts the score of the student sfake and teacher sreal under this distribution, and sets this as the gradient for the generator.
This procedure is captured in the integral on the right hand side
Self-forcing KL divergence formulation showing the expected reverse KL over noise levels.Yin et al., 2024 (DMD2) ↗
The expectation over t comes from picking a random noise level t to noise to
z is a unit normal variable (pure noise image/video)
Gθ(z) is the clean image generated by our generators self rollout
F is the forward process (just adding gaussian noise matching our noise level t)
and sfakesreal are the fake and real scores estimated by our generator network and teacher network respectively.
This is the gradient for the surrogate objective in DMD, which has an expectation over noise levels t in order to overlap the score distributions detailed in DMD section 3.2. Removing the gradient from both the left and middle of the above equation reveals the loss landscape LDMD being optimized:
Ebackward[KL(pfake∥preal)]
Hence, the unapproximated Causvid objective (SF + DMD) optimizes the expected reverse KL over all noise levels.
This makes sense as a surrogate to optimize, as removing the expectation over t would reduce the loss to minimizing the reverse KL (at the no noise level). i.e. minimizing the distance between the clean image/video distributions generated by the real world vs our model:
Reverse KL divergence at the clean noise level between real and model-generated distributions.Yin et al., 2023 (DMD) ↗
Thus, the marginal distribution over training inputs is neither pdata nor pθ, but a pushforward distribution.
We write the DMD gradient update under this new distribution and compare it to the formulas for forward and reverse KL to see that our method can not be optimizing a loss landscape composed of a linear combination of forward/reverse KL divergences.
Proof: Approximated Causvid ≠ linear combination of forward/reverse KL