week-3

Stable Diffusion features

Latent Diffusion

Since image size is strongly correlated with the model’s compute cost, especially under self-attention, latent diffusion uses a special model, the VAE (Variational Auto-Encoder), to compress images. It maps an image into a lower-dimensional “latent representation” that goes into the UNet diffusion process, and then uses the VAE’s decoder to recover an image at the original resolution.

// TODO: code example

Text-conditioned generation

Stable Diffusion uses a pre-trained Transformer model from CLIP. CLIP turns text into a feature vector, which can then be compared with image feature vectors for similarity. … TODO

Classifier-Free Guidance

Sometimes the model, during generation, leans more on the “noisy” image to make predictions rather than the text (prompt). This is mainly because the text vector might be only weakly related to the image, so the model still relies on the noisy image to predict the output.

There is a small trick for this: Classifier-Free Guidance (CFG). During training, the text condition is occasionally set to empty so the model produces unguided images. That way, the model can compare the “guided” case and the “unguided” case. “Guided” means there is a text description, and “unguided” means there is none. By tuning the “strength” of the images generated under these two cases, that is, by balancing between them, the model can more accurately produce images that match the description.

Other types of conditional generation models

Img2Img
- Used for image-to-image translation, such as style transfer or image super-resolution.
Inpainting
- Image restoration. Basically, it predicts the masked region based on the un-masked area of the image.
Depth2Img

Fine-tuning with DreamBooth

You can teach the model some new concepts. That said, DreamBooth fine-tuning is very sensitive on Stable Diffusion models.

Components of Stable Diffusion

vae: latent diffusion, used to compress the image
text_encoder: vectorizes the prompt
tokenizer: breaks text down into the smallest units the model can process
unet: combines the latent space vector and the text feature vector and starts generating the image
scheduler: guides the UNet through a series of noise-adding and denoising iterations, gradually refining the image
safety_checker: the safety checker’s role is to make sure the generated image does not contain inappropriate content
feature_extractor:

ChangeLog

20231105–init
20260501–translate by claude code

[AIGC Note] Stable Diffusion reading group week 3 notes

Simple usage of the Diffusion model and image generation