[AIGC Note] Stable Diffusion reading group week 1 notes

Simple usage of the Diffusion model and image generation

Posted by Jamie on Wednesday, October 18, 2023

week-1

Diffusion model notes

What is a diffusion model?

If you drop a single drop of ink into a glass of water, calculating how the ink diffuses at the very beginning is extremely difficult. But after some time, when the ink reaches a uniform distribution in the water, calculating the probability distribution of the ink in the water becomes very simple. The diffusion model is built on this kind of “from chaos to order” probability distribution, and then “reverses” it, training a model to recover backward.

The earliest diffusion model is DDPM (Denoising Diffusion Probabilistic Model), which assumes that the diffusion process is a Markov process (the probability distribution at each time step is obtained from the state of the previous time step plus Gaussian noise added at the current time step), and the reverse process is also a Gaussian distribution.

TODO: math equations

How does the diffusion model differ from other models?

Here is what I have found so far:

  • GANs (Generative Adversarial Networks): Generative adversarial networks. GANs consist of two parts: a generator and a discriminator. The generator produces “fake images”, and the discriminator distinguishes real images from fake ones. Through this adversarial process, the generator gradually learns to produce more and more realistic images.
  • Flow-based models: The core idea is to construct an invertible transformation that maps a known simple distribution (such as a Gaussian distribution) to a complex data distribution. Because the transformation is invertible, you can easily convert between the two distributions in either direction. One important feature of this model is that it can directly compute the exact probability of the generated data.
  • VAEs (Variational Autoencoders): They use “variational inference” to learn a representation of an image and learn how to use that representation to generate images.
  • Auto-Regressive Models: This is more of a category, like PixelRNN and PixelCNN, which generate images one pixel at a time and use the previous pixels to generate the next ones.
  • Transformer-based Models: Mainly inspired by the transformer in NLP, also used to train pre-trained models that generate images (such as OpenAI’s DallE series).

The main advantage of the diffusion model is “the ability to generate images of high enough quality” along with “interpretability”. Since it computes Gaussian distributions through Markov chains, it allows for better probability estimation and clearer model interpretation.

How do you use someone else’s model to generate images?

One option is to directly use a service someone else has built, and another is to use an open-source model someone has shared on HuggingFace to do the generation.

Here is a simple implementation in a Colab environment using StableAI’s model:

  • Install dependencies:

    !pip install -qq -qq -U diffusers datasets transformers accelerate ftfy pyarrow==9.0.0 gradio
    
  • Log in to HuggingFace

    from huggingface_hub import notebook_login
    
    notebook_login()
    
  • Import libraries and select a model

    from PIL import Image
    import numpy as np
    import torch
    from diffusers import StableDiffusionPipeline
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Use the stabilityai model
    model_id = "stabilityai/stable-diffusion-2"
    
    # Load the pipeline
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to(
    device
    )
    
  • Set the Prompt and generate an image

    prompt = "an apple in the outer space"
    image = pipe(prompt, num_inference_steps=100, guidance_scale=7.5).images[0]
    
    # num_inference_steps controls the number of diffusion steps. More steps means better image quality.
    # guidance_scale controls how much the prompt guides the image generation.
    image
    

In the example above, the key part is pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to( device ), which loads a StableDiffusion-type pre-trained model.

How to train your own diffusion model

Simple process

  • Building and training the model
    • Use UNet2DModel to build a U-Net model. This is a deep learning architecture used for image denoising and image segmentation.
    • Set up DDPMScheduler to manage the noise addition.
    • Train the model in a loop, letting the model try to recover the original image from a noisy image.
  • Using the model
    • Once trained, the model can take a specific image, add noise to it, and then try to recover the image from the noise.
    • The model’s “prediction result” is the image the model recovers from the noisy image.
  • Saving and loading the model
    • Use DDPMPipeline and the huggingface_hub tools to save the trained model to the Hugging Face Hub.
    • Then you can load the model from the Hugging Face Hub and use it to generate new images.

In particular, “using a prompt to generate an image” works by giving the model a “random starting point” image, and then iterating multiple times (or feeding the previous image as input), modifying the image with noise from the current step.

colab demo

TODO

ChangeLog

  • 20231018-init
  • 20260501–translate by claude code