Introduction to Diffusion Models: From Core Concepts to Cutting-Edge Applications

In recent years, diffusion models have taken the generative modeling world by storm, particularly in image synthesis, often producing stunning results. This post aims to provide a comprehensive introduction, starting from the fundamental concepts and moving towards the advanced techniques that power today’s state-of-the-art models.

What are Diffusion Models?

Diffusion models are a class of generative models. While other approaches like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Flow-based models have achieved significant success, they each come with their own challenges—GANs can suffer from training instability, VAEs rely on surrogate loss functions, and Flow models require specialized architectures for reversible transformations.

Inspired by non-equilibrium thermodynamics, diffusion models offer a different paradigm. They work through two main processes:

Forward Process (Diffusion Process): Gradually add small amounts of random noise (typically Gaussian) to the data over many steps, eventually transforming the data distribution into a simple, known distribution (like a standard Gaussian).
Reverse Process (Denoising Process): Learn to reverse the diffusion process. Starting from pure noise, incrementally remove the noise step-by-step to generate a sample that belongs to the original data distribution.

The core of the generative capability lies in the neural network trained to perform this “denoising” at each step. Diffusion models feature a fixed training procedure and, unlike VAEs or Flow models, typically operate with latent variables that have the same dimensionality as the original data.

The Forward Process: Turning Data into Noise

Starting with an initial data point \(\mathbf{x}_0\) drawn from the true data distribution \(q(\mathbf{x})\), the forward process defines a Markov chain that adds Gaussian noise over \(T\) discrete time steps. The transition at each step \(t\) is defined as:

\[q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I})\]

Here, \(\\{\beta_t \in (0, 1)\\}_{t=1}^T\) is a variance schedule, a set of hyperparameters controlling the amount of noise added at each step. Typically, \(\beta_t\) increases with \(t\), meaning more noise is added in later steps (common schedules include linear and cosine [Nichol & Dhariwal, 2021]). \(\mathbf{I}\) is the identity matrix.

The joint distribution over all noisy samples given the initial data is:

\[q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})\]

A key property of this process is that we can sample the noisy version \(\mathbf{x}_t\) at any arbitrary timestep \(t\) directly from the original data \(\mathbf{x}_0\) in a closed form. Defining \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\), the distribution of \(\mathbf{x}_t\) given \(\mathbf{x}_0\) is:

\[q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})\]

This can also be written using the reparameterization trick: \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\), where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) is standard Gaussian noise. Essentially, \(\mathbf{x}_t\) is a scaled version of the original data plus scaled noise. As \(T \to \infty\), \(\bar{\alpha}_T \approx 0\), and \(\mathbf{x}_T\) becomes almost pure Gaussian noise \(\mathcal{N}(\mathbf{0}, \mathbf{I})\), independent of the starting \(\mathbf{x}_0\).

The Reverse Process: From Noise back to Data

The generative process reverses the forward diffusion. We start by sampling pure noise \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and then iteratively sample \(\mathbf{x}_{t-1}\) given \(\mathbf{x}_t\) for \(t = T, T-1, \dots, 1\) to eventually obtain a sample \(\mathbf{x}_0\).

To do this, we need the reverse transition probability \(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\). However, this distribution is intractable because it depends on the entire dataset. Therefore, we approximate it using a parameterized neural network, typically denoted as \(p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\).

The entire reverse process is defined as:

\[p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\]

where the starting noise distribution is \(p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})\). Each reverse transition step \(p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) is usually modeled as a Gaussian distribution:

\[p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))\]

The goal of the model is to learn the mean \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\) and the covariance \(\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)\) of this reverse transition. In practice, the covariance \(\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)\) is often not learned directly but is set to a fixed diagonal matrix \(\sigma_t^2 \mathbf{I}\). Common choices for \(\sigma_t^2\) include \(\beta_t\) (from the forward process) or \(\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t\) (derived theoretically). While learning the variance was explored by [Nichol & Dhariwal, 2021] (e.g., as an interpolation between \(\beta_t\) and \(\tilde{\beta}_t\)), it can sometimes lead to instability.

The Learning Objective: Predicting the Noise

How do we train the network to learn \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\)? While the full derivation involves maximizing the Variational Lower Bound (VLB) on the data log-likelihood, the DDPM paper [Ho et al., 2020] introduced a simpler, more intuitive objective that works remarkably well in practice.

The core idea is to reparameterize the model. Instead of directly predicting the mean \(\boldsymbol{\mu}_\theta\) of the reverse step, the model is trained to predict the noise component \(\boldsymbol{\epsilon}\) that was added to the original data \(\mathbf{x}_0\) to produce \(\mathbf{x}_t\) during the forward process. Let’s denote this noise-predicting model as \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\).

Using the relationship \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\), we can express the (true) mean of the reverse step \(\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)\) (which we could compute if we knew \(\mathbf{x}_0\)) in terms of this noise \(\boldsymbol{\epsilon}\). We train our model \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) to predict this true noise \(\boldsymbol{\epsilon}\), which in turn allows us to estimate the desired mean \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\).

Specifically, the learned mean \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\) is parameterized using the predicted noise \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) as follows:

\[\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)\]

This equation shows that learning the noise prediction model \(\boldsymbol{\epsilon}_\theta\) is sufficient to determine the mean of the reverse step.

The simplified training objective proposed in DDPM is then simply to minimize the Mean Squared Error (MSE) between the predicted noise \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) and the actual noise \(\boldsymbol{\epsilon}\) that was added:

\[L_\text{simple} = \mathbb{E}_{t \sim \mathcal{U}(1, T), \mathbf{x}_0 \sim q(\mathbf{x}_0), \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}, t)\|^2 \right]\]

Training Loop: 1. Sample a real data point \(\mathbf{x}_0 \sim q(\mathbf{x}_0)\). 2. Sample a random timestep \(t\) uniformly from \(\\{1, \dots, T\\}\). 3. Sample a standard Gaussian noise \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\). 4. Compute the noisy version \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}\). 5. Feed \(\mathbf{x}_t\) and \(t\) into the model \(\boldsymbol{\epsilon}_\theta\) to get the noise prediction \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\). 6. Calculate the MSE loss between \(\boldsymbol{\epsilon}\) and \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\). 7. Update the model parameters \(\theta\) using gradient descent on this loss.

Connection to Score Matching: Interestingly, this noise prediction task is closely related to score matching. The score function of a distribution \(q(\mathbf{x})\) is defined as the gradient of its log-probability density, \(\nabla_{\mathbf{x}} \log q(\mathbf{x})\). The predicted noise \(\boldsymbol{\epsilon}_\theta\) is approximately proportional to the score of the noisy data distribution at time \(t\): \(\mathbf{s}_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) \approx - \frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}\). This links diffusion models to score-based generative models like NCSN [Song & Ermon, 2019].

Evolution and Applications of Diffusion Models

Following the success of DDPM, research has focused on improving performance, expanding capabilities, and addressing limitations.

Conditional Generation

Generating samples based on specific information like class labels, text descriptions, or other images.

Classifier Guidance: Proposed by [Dhariwal & Nichol, 2021]. This method uses a separately trained classifier \(f_\phi(y \vert \mathbf{x}_t)\) that predicts the class label \(y\) given a noisy input \(\mathbf{x}_t\). During generation, the gradient of the classifier’s log-likelihood \(\nabla_{\mathbf{x}_t} \log f_\phi(y \vert \mathbf{x}_t)\) is used to “guide” the noise prediction towards the desired class. The modified noise prediction is: \[\bar{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t) = \boldsymbol{\epsilon}_\theta(x_t, t) - w \sqrt{1 - \bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log f_\phi(y \vert \mathbf{x}_t)\] where \(w\) is a guidance scale factor. This was used in models like ADM (Ablated Diffusion Model) and ADM-G (ADM with Guidance).
Classifier-Free Guidance (CFG): Proposed by [Ho & Salimans, 2021]. This popular technique avoids the need for a separate classifier. The diffusion model \(\boldsymbol{\epsilon}_\theta\) itself is trained to handle both conditional inputs (\(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)\)) and unconditional inputs (\(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)\), where \(y=\varnothing\) represents the null condition). This is often achieved by randomly dropping the condition \(y\) during training. At inference time, guidance is achieved by extrapolating from the conditional and unconditional predictions: \[\bar{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, y) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing) + w (\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing))\] This can also be written as \((w+1) \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)\). CFG is widely used in modern high-performance models like Imagen, Stable Diffusion, and GLIDE. [Nichol et al., 2022] (GLIDE) found CFG to yield better results than guidance based on CLIP embeddings.

Speeding Up Sampling

The primary drawback of early diffusion models was the slow sampling speed due to the large number of steps (\(T\)). Significant progress has been made here.

DDIM (Denoising Diffusion Implicit Models): Proposed by [Song et al., 2020]. While sharing the same forward process as DDPM, DDIM defines a non-Markovian generative process that allows for much larger step sizes. This makes the sampling deterministic (controlled by a parameter \(\eta\); \(\eta=0\) for DDIM, \(\eta=1\) approximates DDPM) and significantly faster (e.g., reducing 1000 steps to 20-50) while maintaining high sample quality. Its deterministic nature also ensures “consistency” (same noise yields same image) and enables latent space interpolation.
Progressive Distillation: Proposed by [Salimans & Ho, 2022]. This technique distills a trained deterministic sampler (like DDIM) into a new “student” model that takes half the number of steps. The student learns to perform two steps of the “teacher” model in a single step. This process can be repeated, exponentially reducing sampling time.
Consistency Models: Proposed by [Song et al., 2023]. These models learn a function \(f(\mathbf{x}_t, t) \approx \mathbf{x}_0\) that directly maps any point \(\mathbf{x}_t\) on a diffusion trajectory back to its origin (or near-origin \(\mathbf{x}_\epsilon\)). They possess a “self-consistency” property. They can be trained either by distilling a pre-trained diffusion model (Consistency Distillation, CD) or from scratch (Consistency Training, CT). They hold the potential for high-quality generation in very few steps, even just one.
Latent Diffusion Models (LDM): Proposed by [Rombach et al., 2022]. Instead of operating directly in the high-dimensional pixel space, LDMs first use a powerful autoencoder (Encoder \(\mathcal{E}\), Decoder \(\mathcal{D}\)) to compress the image \(\mathbf{x}\) into a lower-dimensional latent representation \(\mathbf{z} = \mathcal{E}(\mathbf{x})\). The diffusion process (often using a U-Net) is then applied entirely within this latent space. To generate an image, noise is denoised in the latent space to produce \(\mathbf{z}\), which is then mapped back to the pixel space using the decoder \(\mathcal{D}(\mathbf{z})\). This drastically reduces computational cost and memory requirements, forming the basis for models like Stable Diffusion. Regularization techniques like KL penalty or Vector Quantization (VQ) are used in the autoencoder training. Conditioning is often implemented using cross-attention mechanisms within the latent U-Net.

Achieving Higher Resolution and Quality

Cascaded Models: Employed by [Ho et al., 2021] and others. This involves a pipeline approach: first generate a low-resolution image, then use one or more super-resolution diffusion models conditioned on the low-resolution output to generate progressively higher-resolution images. “Noise Conditioning Augmentation” (adding noise to the low-resolution conditioning input) was found crucial for improving quality by mitigating error accumulation.
unCLIP / DALL-E 2: Proposed by [Ramesh et al., 2022]. These models leverage the powerful CLIP model for high-quality text-to-image generation. They typically involve a two-stage process: (1) A “prior” model generates a CLIP image embedding \(\mathbf{c}^i\) conditioned on the input text \(y\) (\(P(\mathbf{c}^i \vert y)\)). (2) A “decoder” diffusion model generates the final image \(\mathbf{x}\) conditioned on the image embedding \(\mathbf{c}^i\) (and optionally the text \(y\)) (\(P(\mathbf{x} \vert \mathbf{c}^i, [y])\)).
Imagen: Proposed by [Saharia et al., 2022]. Instead of CLIP, Imagen uses large, pre-trained frozen language models (like T5-XXL) as text encoders, finding that the scale of the text encoder was more critical than the scale of the diffusion U-Net. It introduced “Dynamic Thresholding” to improve image fidelity at high CFG scales by adaptively clipping predicted pixel values. It also proposed an “Efficient U-Net” architecture with modifications like shifting parameters to lower-resolution blocks and optimizing the order of up/downsampling operations.
Architectural Evolution (U-Net, DiT, ControlNet):
- U-Net: The classic architecture with downsampling/upsampling paths and skip connections remains a standard backbone for many diffusion models, especially in image domains.
- DiT (Diffusion Transformer): Proposed by [Peebles & Xie, 2023]. Adapts the Transformer architecture for diffusion, operating on latent patches (similar to LDM). It involves patchifying the latent representation, processing the sequence of patches through Transformer blocks, and incorporating conditioning (like timestep \(t\) and class \(c\)) via adaptive layer normalization (adaLN-Zero). DiT benefits from the known scalability of Transformers.
- ControlNet: Proposed by [Zhang et al., 2023]. A technique for adding fine-grained spatial control (e.g., based on edge maps, human poses, depth maps) to large, pre-trained text-to-image diffusion models without expensive retraining. It works by creating a trainable copy of the model’s weights and connecting it to the original frozen weights via special “zero convolution” layers. These zero-initialized layers allow stable training of the control mechanism while preserving the original model’s capabilities. The output combines the original block’s output with the controlled copy’s output: \(\mathbf{y}_c = \mathcal{F}_\theta(\mathbf{x}) + \mathcal{Z}_{\theta_{z2}}(\mathcal{F}_{\theta_c}(\mathbf{x} + \mathcal{Z}_{\theta_{z1}}(\mathbf{c})))\).

Summary

Diffusion models represent a powerful and flexible class of generative models based on systematically destroying data structure with noise and then learning to reverse the process.

Advantages: They achieve state-of-the-art results in generating high-quality, diverse samples, particularly for images. They benefit from theoretical tractability and relatively stable training compared to alternatives like GANs.
Disadvantages: Historically, their main drawback was slow sampling speed, requiring many sequential denoising steps. However, significant advancements (DDIM, LDM, distillation, consistency models) have drastically improved sampling efficiency, making them much more practical.

Fueled by innovations like Classifier-Free Guidance, Latent Diffusion, Transformer-based architectures, and control mechanisms like ControlNet, diffusion models are at the forefront of generative AI, enabling cutting-edge applications in text-to-image synthesis, image editing, video generation, and beyond.

References

Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” NeurIPS 2020. (DDPM)
Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” ICLR 2021. (DDIM)
Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” CVPR 2022. (Latent Diffusion / Foundation for Stable Diffusion)
Nichol, Alex, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” ICML 2021.
Dhariwal, Prafulla, and Alex Nichol. “Diffusion models beat gans on image synthesis.” NeurIPS 2021.
Ho, Jonathan, and Tim Salimans. “Classifier-free diffusion guidance.” NeurIPS 2021 Workshop.
Salimans, Tim, and Jonathan Ho. “Progressive distillation for fast sampling of diffusion models.” ICLR 2022.
Song, Yang, et al. “Consistency models.” ICML 2023.
Ho, Jonathan, et al. “Cascaded diffusion models for high fidelity image generation.” JMLR 2022.
Ramesh, Aditya, et al. “Hierarchical text-conditional image generation with clip latents.” arXiv 2022. (unCLIP / DALL-E 2)
Saharia, Chitwan, et al. “Photorealistic text-to-image diffusion models with deep language understanding.” NeurIPS 2022. (Imagen)
Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” ICCV 2023. (DiT)
Zhang, Lvmin, and Maneesh Agrawala. “Adding conditional control to text-to-image diffusion models.” ICCV 2023. (ControlNet)