Stable Diffusion
Introduction
A diffusion process is a physical process in which a liquid gradually diffuses into another liquid, for example, milk particles into caffe. This idea is applied into a deep learning model called stable diffusion, to generate new images. In this model, there are two different processes: diffusion process and denoising process. In the diffusion process, the model starts with original images. It gradually adds Gaussian noise into the image in a series of time steps. This slow process chips away the original image’s structure and detail, turning it into a simpler distribution. At the end, the image becomes completely noise (following a simple Gaussian distribution). The denoising process is the reversed process and it is used to generate new samples. It starts from samples from the Gaussian distribution (noisy images). Then it uses a neural network to trace out a reverse path. The path is to gradually remove the noise that was supposedly added during the diffusion process. If successful, this denoising process would learn to transform the Gaussian blob back into the structured piece of data. And then that same neural network can be used to generate a new image from a random noisy images. The generated image would look real, since the neural network has learned the underlying patterns of a real image, even if it was actually hallucinated. The training of the neural network involves comparing the denoised data to the original data at each step, and adjusting the network’s parameters to minimize the difference.
Diffusion model
Given a data distribution
The first step is to generate a sequence of beta value in each time step t. This sequence is a function of time t and is the variance of the Gaussian noise that would be added to the data at each time step t. Given a large T and a good schedule of
After having
Each step from
If we know the reverse distribution
To denoise, first we define
Second, we can calculate the backward process as follows:
So the beta sequence is reversed, from high beta values to lower beta values. Then neural network is trained to maximize the log likelihood of the training data under the model. Once the neural network is trained, it is used to generate new samples. Starting from Gaussian noise, the denoising transformation is applied. The unstructured data gradually comes back into structured data, resembling the original data.
Latent diffusion model
Since diffusion model typically operates directly in pixel space, the computation resource is quite expensive, usually takes hundreds of GPU days. The latent diffusion model was born to address this issue of computing resource while retaining the quality and flexibility of the model. In latent diffusion model, autoencoders are pretrained in the latent space. Since the computation is done mostly in the latent space which reduces the dimensions, a great amount of computation is reduced too.
There are two phases in the latent diffusion model. In the first one, an autoencoder is trained to provide lower-dimensional representational space. This space is more efficient since it has lower dimensions but has mostly the same amount of information. Then the diffusion model is trained in the learned latent space, for multiple tasks.
The adoption of such method offers many advantages over original diffusion models. Firstly, it increases computational efficiency. By changing the operation from high dimensional image space to lower dimensional latent space, diffusion models gain significant efficiency on computation. It enables the handling of larger dataset and more complex tasks. Secondly, it increases effectiveness with spatially structured data. Using excellent architecture such as UNET, the model can understand and utilize spatial relationship within the data, making it particularly well suited for image or spatial data processing. Thirdly, the model inherently is a general purpose compression scheme. It doesn’t just have standard use cases but also is open to a wider range of applications. Since the latent space can be utilized to train multiple generative models, for multiple purposes, the utility and versatility is enhanced. Forthly, more downstream tasks become available, such as single image CLIP-guided synthesis. The output of this system can be fed into another system for further processing or analysis, increasing the complexity of the machine learning workflow.
The perceptual compression is an autoencoder trained by perceptual loss and patch based adversarial objective. This enforces local realism and avoid blurriness. Given an image x in RGB space, the encoder E would encode x into a latent representation z = E(x), and the decoder D would reconstruct the image from the latent representation,
The objective function of a latent diffusion model would become:
We can also do conditional generative
with
The objective function for a conditional latent diffusion model (LDM) is:
Example
Here is some images in the fashion MNIST that can be used for the training of diffusion model.
Here is some result from the generative module of the diffusion model.
Conclusion
In the realm of generative models, diffusion models represent a new frontier of research. They allow the synthesis of high fidelity data samples of images and text, while maintaining a probabilistic nature, for variety and creativity. The transition from high dimensional space to a more manageable latent space by latent diffusion models is a promising advancement. The boundary of generative models will continued to be pushed in the near future, since there are so many applications and downstream tasks that benefit from them. We would expect to see more and more innovative applications and advancements in this field.