week-1
Diffusion model notes
What is a diffusion model?
If you drop a single drop of ink into a glass of water, calculating how the ink diffuses at the very beginning is extremely difficult. But after some time, when the ink reaches a uniform distribution in the water, calculating the probability distribution of the ink in the water becomes very simple. The diffusion model is built on this kind of “from chaos to order” probability distribution, and then “reverses” it, training a model to recover backward.
The earliest diffusion model is DDPM (Denoising Diffusion Probabilistic Model), which assumes that the diffusion process is a Markov process (the probability distribution at each time step is obtained from the state of the previous time step plus Gaussian noise added at the current time step), and the reverse process is also a Gaussian distribution.
TODO: math equations
How does the diffusion model differ from other models?
Here is what I have found so far:
- GANs (Generative Adversarial Networks): Generative adversarial networks. GANs consist of two parts: a generator and a discriminator. The generator produces “fake images”, and the discriminator distinguishes real images from fake ones. Through this adversarial process, the generator gradually learns to produce more and more realistic images.
- Flow-based models: The core idea is to construct an invertible transformation that maps a known simple distribution (such as a Gaussian distribution) to a complex data distribution. Because the transformation is invertible, you can easily convert between the two distributions in either direction. One important feature of this model is that it can directly compute the exact probability of the generated data.
- VAEs (Variational Autoencoders): They use “variational inference” to learn a representation of an image and learn how to use that representation to generate images.
- Auto-Regressive Models: This is more of a category, like PixelRNN and PixelCNN, which generate images one pixel at a time and use the previous pixels to generate the next ones.
- Transformer-based Models: Mainly inspired by the transformer in NLP, also used to train pre-trained models that generate images (such as OpenAI’s DallE series).
The main advantage of the diffusion model is “the ability to generate images of high enough quality” along with “interpretability”. Since it computes Gaussian distributions through Markov chains, it allows for better probability estimation and clearer model interpretation.
How do you use someone else’s model to generate images?
One option is to directly use a service someone else has built, and another is to use an open-source model someone has shared on HuggingFace to do the generation.
Here is a simple implementation in a Colab environment using StableAI’s model:
Install dependencies:
!pip install -qq -qq -U diffusers datasets transformers accelerate ftfy pyarrow==9.0.0 gradioLog in to HuggingFace
from huggingface_hub import notebook_login notebook_login()Import libraries and select a model
from PIL import Image import numpy as np import torch from diffusers import StableDiffusionPipeline device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Use the stabilityai model model_id = "stabilityai/stable-diffusion-2" # Load the pipeline pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to( device )Set the Prompt and generate an image
prompt = "an apple in the outer space" image = pipe(prompt, num_inference_steps=100, guidance_scale=7.5).images[0] # num_inference_steps controls the number of diffusion steps. More steps means better image quality. # guidance_scale controls how much the prompt guides the image generation. image
In the example above, the key part is pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to(
device
), which loads a StableDiffusion-type pre-trained model.
How to train your own diffusion model
Simple process
- Building and training the model
- Use
UNet2DModelto build a U-Net model. This is a deep learning architecture used for image denoising and image segmentation. - Set up
DDPMSchedulerto manage the noise addition. - Train the model in a loop, letting the model try to recover the original image from a noisy image.
- Use
- Using the model
- Once trained, the model can take a specific image, add noise to it, and then try to recover the image from the noise.
- The model’s “prediction result” is the image the model recovers from the noisy image.
- Saving and loading the model
- Use
DDPMPipelineand thehuggingface_hubtools to save the trained model to the Hugging Face Hub. - Then you can load the model from the Hugging Face Hub and use it to generate new images.
- Use
In particular, “using a prompt to generate an image” works by giving the model a “random starting point” image, and then iterating multiple times (or feeding the previous image as input), modifying the image with noise from the current step.
colab demo
TODO
ChangeLog
- 20231018-init
- 20260501–translate by claude code