DDPM connection to SDEs

In the past few years, diffusion models (Ho, Jain, and Abbeel 2020; Song et al. 2021; Rombach et al. 2022; Brooks et al. 2024) have taken the machine learning community by storm, setting state-of-the-art results in image generation and other creative synthesis tasks. From a statistical view, these models are used to sample from a complex distribution (e.g., images of cats) by iteratively updating a simple distribution (e.g., isotropic Gaussian). However, these models often appear in two different flavors, which can be confusing for newcomers. One line of work defines diffusion models as a forward–backward Markov chain, where data is gradually corrupted by noise in a sequence of discrete steps and then recovered through a learned reverse diffusion process. Another line of research interprets the same idea through continuous stochastic processes described by Stochastic Differential Equations (SDEs), where data flows forward in time under a noising SDE and is restored by solving the corresponding time-reversed SDE. Although these two approaches may seem distinct at first glance, they are in fact closely related: the Markov chain formulation can be seen as a discretization of the continuous SDE framework. In this post, we’ll explore this connection and see how the discrete-time diffusion process converges to a continuous SDE in the limit of infinitely many small steps. First, let’s start by introducing the two definitions of diffusion models.

Diffusion Models with Discrete Markov Chains (DDPMs)

Diffusion models were popularized by Denoising Diffusion Probabilistic Models (DDPMs) (Ho, Jain, and Abbeel 2020). The core idea is to start with a data distribution \(p_0(\boldsymbol{x}_0)\) and define a forward diffusion process that gradually adds noise to the data over a finite sequence of time steps \((\boldsymbol{x}_0, \boldsymbol{x}_1, \ldots, \boldsymbol{x}_T)\). At each step, Gaussian noise is added according to a predefined schedule \(\{\beta_t\}_{t=1}^N\). This process transforms an original data sample \(\boldsymbol{x}_0\) into a nearly pure noise sample \(\boldsymbol{x}_T\). Formally, the forward process is given by:

\[ \boldsymbol{x}_t = \sqrt{1 - \beta_t} \, \boldsymbol{x}_{t-1} + \sqrt{\beta_t} \, \boldsymbol{z}_{t-1}, \quad \boldsymbol{z}_{t-1} \sim \mathcal{N}(0, I), \quad t=1, \ldots, N \]

A neural network \(\hat{\epsilon}(\boldsymbol{x}_{t}, t)\) is then trained to approximate the noise added at a single timestep to iteratively reverse the process. That is, starting from noise \(\boldsymbol{x}_T \sim \mathcal{N}(0, I)\), we iteratively denoise it until we recover a sample \(\boldsymbol{x}_0 \sim p(\boldsymbol{x}_0)\). This reverse procedure is also a Markov chain, but its transition kernels are learned. Once trained, sampling from the model is done by reversing the diffusion process step-by-step:

where we have defined \(\alpha_t := 1 - \beta_t\) and \(\bar{\alpha}_t := \prod_{s=1}^t \alpha_s\) for \(t=1,...,T\). A common choice of hyperparameters in this setting is:

\[ \begin{align*} N &= 1000, \\ \beta_1 &= 10^{-4}, \\ \beta_T &= 0.02, \\ \beta_t &= \beta_{t-1} + \frac{\beta_N - \beta_1}{N}, \quad \text{for } t=2, \ldots, N-1 \end{align*} \]

Diffusion Using Stochastic Differential Equations (SDEs)

An alternative perspective on diffusion models comes from continuous-time formulations, where the noising process is described as a solution to a Stochastic Differential Equation (SDE) (Song et al. 2021). Instead of a finite number of steps, we imagine a continuous time variable \(t \in [0,1]\) that governs the diffusion of clean data into noise:

\[ d\boldsymbol{x} = \boldsymbol{f}(\boldsymbol{x}, t) \, dt + g(t) \, d\boldsymbol{w}, \]

where \(\boldsymbol{w}\) is the standard Wiener process (a.k.a., Brownian motion). If we know the gradient of the distribution of the data with respect to the data (at each \(t\)), i.e., \(\nabla_{\boldsymbol{x}} \log p_t(\boldsymbol{x})\), we can reverse this process:

\[ d\boldsymbol{x} = \left[ \boldsymbol{f}(\boldsymbol{x}, t) - g(t)^2 \nabla_{\boldsymbol{x}} \log p_t(\boldsymbol{x}) \right] dt + g(t) \, d\bar{\boldsymbol{w}}, \]

where \(\bar{\boldsymbol{w}}\) is a standard Wiener process when time flows backwards. Note that \(dt\) in this equation is negative, as we are propagating from \(t = 1\) to \(t = 0\). If we do not know \(\nabla_{\boldsymbol{x}} \log p_t(\boldsymbol{x})\), which is also known as the score function, we can approximate it using a Neural Network.

There are many possible choices for the functions \(\boldsymbol{f}(\boldsymbol{x}, t)\) and \(g(t)\), but we choose \(\boldsymbol{f}(\boldsymbol{x}, t) = -\frac{1}{2} \beta(t) \boldsymbol{x}(t)\) and \(g(t) = \sqrt{\beta(t)}\). We can show that a discretization of the SDE (this particular SDE is known as the VP-SDE) is equivalent to the Markov Chain formulation of the Diffusion Models. In this case, we have the following forward SDE:

\[ d\boldsymbol{x} = -\frac{1}{2} \beta(t) \boldsymbol{x}(t) \, dt + \sqrt{\beta(t)} \, d\boldsymbol{w}, \]

which can be reversed by:

\[ d\boldsymbol{x} = \left[ -\frac{1}{2} \beta(t) \boldsymbol{x}(t) - \beta(t) \nabla_{\boldsymbol{x}} \log p_t(\boldsymbol{x}) \right] dt + \sqrt{\beta(t)} \, d\bar{\boldsymbol{w}}. \]

For the two formulations to be “equivalent,” we need to set \(\beta(t) := \bar{\beta}_{t+\Delta t} := N \beta_{tN+1}\) at the points of discretization. If we use the same hyperparameters as above and discretize the SDE at the same points as the Markov Chain, then we get the following discretization of \(\bar{\beta}_t\):

\[ \begin{align*} \bar{\beta}_1 &= 0.1, \\ \bar{\beta}_T &= 20, \\ \bar{\beta}_t &= \bar{\beta}_{t-1} + \frac{\bar{\beta}_N - \bar{\beta}_1}{N}, \quad \text{for } t=2, \ldots, N-1 \end{align*} \]

Understanding SDEs

To build an intuitive understanding of stochastic differential equations, it helps to get our hands dirty with a straightforward numerical experiment. The idea is to think of an SDE as describing a continuous-time process that evolves with both a deterministic trend (the “drift”) and a random “noise” component that keeps shaking things up as time goes by. This might sound abstract, but we can make it more concrete by actually simulating a path of a simple SDE using the simplest and most intuitive method: the Euler–Maruyama method. To simulate the forward diffusion using the VP-SDE we have formulated above, we do the following:

Set the timestep \(\Delta t = 1/N = 0.001\).
Take a sample \(\boldsymbol{x}_0\) from our dataset.
For \(t = \Delta t, 2\Delta t, \ldots, 1\) do
- Set \(\boldsymbol{x}_{t} = \boldsymbol{x}_{t-\Delta t} - \frac{1}{2} \bar{\beta}_t \boldsymbol{x}_{t-\Delta t} \Delta t + \sqrt{\bar{\beta}_t \Delta t} \boldsymbol{z}_{t-\Delta t}\) where \(\boldsymbol{z}_{t-\Delta t} \sim \mathcal{N}(0, I)\).

To reverse this process (i.e., generate samples that could exist in our dataset):

Sample \(\boldsymbol{x}_1 \sim \mathcal{N}(0, I)\) from our dataset.
For \(t = 1, (1 - \Delta t), (1 - 2\Delta t), \ldots, \Delta t\) do
- Set \(\boldsymbol{x}_{t} = \boldsymbol{x}_{t+\Delta t} + \left[ \frac{1}{2} \bar{\beta}_{t} \boldsymbol{x}_{t+\Delta t} + \bar{\beta}_{t} \nabla_{\boldsymbol{x}} \log p_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) \right] \Delta t + \sqrt{\bar{\beta}_{t} \Delta t} \, \boldsymbol{z}_{t+\Delta t},\) where \(\boldsymbol{z}_{t+\Delta t} \sim \mathcal{N}(0, I)\).

Showing the equivalence

To show that a discretisation of the SDE is equivalent to the Markov Chain (which is also shown in the appendix of (Song et al. 2021)), we first show that the forward processes are equivalent and then that the reverse processes are equivalent as \(\Delta t \to 0\) (which is equivalent to \(N \to \infty\)).

Equivalence of Forward Processes

We need to show that:

\[ \boldsymbol{x}_t = \sqrt{1 - \beta_t} \, \boldsymbol{x}_{t-1} + \sqrt{\beta_t} \, \boldsymbol{z}_{t-1} \iff \boldsymbol{x}_{t} = \boldsymbol{x}_{t-1} - \frac{1}{2} \bar{\beta}_t \boldsymbol{x}_{t-1} \Delta t + \sqrt{\bar{\beta}_t \Delta t} \, \boldsymbol{z}_{t-1}, \tag{1} \label{eq:equiv_1} \]

as \(\Delta t \to 0\), where we use \(\boldsymbol{x}_{t-1}\) rather than \(\boldsymbol{x}_{t-\Delta t}\) for notational convenience. To show that the equivalence holds, we use the Taylor approximation of \(f(x) = \sqrt{1 - x}\) around \(x = 0\):

\[ \sqrt{1 - x} = 1 - \frac{x}{2} - \frac{x^2}{8} - \ldots \]

If we set \(x = \bar{\beta}_t \Delta t\), we have:

\[ \sqrt{1 - \bar{\beta}_t \Delta t} = 1 - \frac{\bar{\beta}_t \Delta t}{2} - \frac{\bar{\beta}_t^2 \Delta t^2}{8} - \ldots \]

and as \(\Delta t\) approaches 0, the approximation \(\sqrt{1 - \bar{\beta}_t \Delta t} \approx 1 - \frac{\bar{\beta}_t \Delta t}{2}\) improves. Then, starting from the left-hand side of Equivalence (\(\ref{eq:equiv_1}\)):

\[ \begin{align*} \boldsymbol{x}_t &= \sqrt{1 - \beta_t} \, \boldsymbol{x}_{t-1} + \sqrt{\beta_t} \, \boldsymbol{z}_{t-1} \\ \iff \boldsymbol{x}_t &= \sqrt{1 - \bar{\beta}_t \Delta t} \, \boldsymbol{x}_{t-1} + \sqrt{\bar{\beta}_t \Delta t} \, \boldsymbol{z}_{t-1}, \quad \text{by the definition of } \bar{\beta}_t \\ \iff \boldsymbol{x}_t &= \left(1 - \frac{\bar{\beta}_t \Delta t}{2}\right) \boldsymbol{x}_{t-1} + \sqrt{\bar{\beta}_t \Delta t} \, \boldsymbol{z}_{t-1}, \quad \text{as } \Delta t \to 0 \\ \iff \boldsymbol{x}_{t} &= \boldsymbol{x}_{t-1} - \frac{1}{2} \bar{\beta}_t \boldsymbol{x}_{t-1} \Delta t + \sqrt{\bar{\beta}_t \Delta t} \, \boldsymbol{z}_{t-1} \end{align*} \]

Equivalence of Reverse Processes

We need to show that:

\[ \boldsymbol{x}_t=\frac{1}{\sqrt{1-\beta_t}}\left(\boldsymbol{x}_{t+1}-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}(\boldsymbol{x}_{t+1},t+1)\right) \iff \boldsymbol{x}_{t} = \boldsymbol{x}_{t+1} + \left[\frac{1}{2}\bar{\beta}_{t} \boldsymbol{x}_{t+1} + \bar{\beta}_{t}\nabla_{\boldsymbol{x}}\log p_{t+1}(\boldsymbol{x}_{t+1})\right]\Delta t, \tag{2} \label{eq:equiv_2} \]

as \(\Delta t \to 0\) where we use \(\boldsymbol{x}_{t+1}\) rather than \(\boldsymbol{x}_{t+\Delta t}\) for notational convenience and have omitted the noise terms for brevity (as we have already shown their equivalence in the previous section). To show that the equivalence holds we use the Taylor approximation of \(g(x)=\frac{1}{\sqrt{1-x}}\) around \(x=0\):

\[\frac{1}{\sqrt{1-x}}=1+\frac{x}{2}+\frac{3x^2}{8}+...\]

If we set \(x=\bar{\beta_t}\Delta t\) we have:

\[\frac{1}{\sqrt{1-\bar{\beta_t}\Delta t}}=1+\frac{\bar{\beta_t}\Delta t}{2}+\frac{3\bar{\beta_t}^2\Delta t^2}{8}+...\]

and as \(\Delta t\) approaches 0 the approximation \(\frac{1}{\sqrt{1-\bar{\beta_t\Delta t}}}\approx 1+\frac{\bar{\beta_t}\Delta t}{2}\) improves. Then, starting from the left hand side of equivalence (\(\ref{eq:equiv_2}\)):

\[ \begin{align*} \boldsymbol{x}_t&=\frac{1}{\sqrt{1-\beta_t}}\left(\boldsymbol{x}_{t+1}-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}(\boldsymbol{x}_{t+1},t+1)\right) \\ \iff \boldsymbol{x}_t &=\frac{1}{\sqrt{1-\bar{\beta}_t\Delta t}}\left(\boldsymbol{x}_{t+1}-\frac{\bar{\beta}_t\Delta t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}(\boldsymbol{x}_{t+1},t+1)\right), \text{ by the definition of } \bar{\beta}_t \\ \iff \boldsymbol{x}_t &= \left(1+\frac{\bar{\beta_t}\Delta t}{2}\right)\left(\boldsymbol{x}_{t+1}-\frac{\bar{\beta}_t\Delta t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}(\boldsymbol{x}_{t+1},t+1)\right), \quad \text{ as } \Delta t \to 0 \\ \iff \boldsymbol{x}_t &= \boldsymbol{x}_{t+1} + \frac{\bar{\beta}_t \Delta t}{2} \boldsymbol{x}_{t+1} - \frac{\bar{\beta}_t \Delta t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}(\boldsymbol{x}_{t+1}, t+1) - \frac{(\bar{\beta}_t \Delta t)^2}{2 \sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}(\boldsymbol{x}_{t+1}, t+1) \\ \iff \boldsymbol{x}_t &= \boldsymbol{x}_{t+1} + \frac{\bar{\beta}_t \Delta t}{2} \boldsymbol{x}_{t+1} - \frac{\bar{\beta}_t \Delta t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}(\boldsymbol{x}_{t+1}, t+1), \quad \text{ as } \left(\Delta t\right)^2\approx 0 \\ \iff \boldsymbol{x}_t &= \boldsymbol{x}_{t+1} + \left(\frac{1}{2}\bar{\beta}_t\boldsymbol{x}_{t+1}-\bar{\beta}_t\frac{\hat{\epsilon}(\boldsymbol{x}_{t+1}, t+1)}{\sqrt{1 - \bar{\alpha}_t}} \right)\Delta t \end{align*} \]

Therefore, when the Neural Network that estimates the noise is:

\[\hat{\epsilon}(\boldsymbol{x}_{t+1}, t+1)=-\sqrt{1 - \bar{\alpha}_t}\nabla_{\boldsymbol{x}}\log p_{t+1}(\boldsymbol{x}_{t+1}),\]

the discretisation of the reverse diffusion of the SDE is equivalent to the reverse Markov Chain. In practise, we train neural networks for each of the two formulations and can divide the neural network of the DDPM model by \(-\sqrt{1 - \bar{\alpha}_t}\) to obtain an approximation of the score function.

References

Brooks, Tim, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, et al. 2024. “Video Generation Models as World Simulators.” https://openai.com/research/video-generation-models-as-world-simulators.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” https://arxiv.org/abs/2006.11239.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. “High-Resolution Image Synthesis with Latent Diffusion Models.” https://arxiv.org/abs/2112.10752.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” https://arxiv.org/abs/2011.13456.