week-3
Stable Diffusion features
Latent Diffusion
Since image size is strongly correlated with the model’s compute cost, especially under self-attention, latent diffusion uses a special model, the VAE (Variational Auto-Encoder), to compress images. It maps an image into a lower-dimensional “latent representation” that goes into the UNet diffusion process, and then uses the VAE’s decoder to recover an image at the original resolution.
// TODO: code example
Text-conditioned generation
Stable Diffusion uses a pre-trained Transformer model from CLIP. CLIP turns text into a feature vector, which can then be compared with image feature vectors for similarity. … TODO
Classifier-Free Guidance
Sometimes the model, during generation, leans more on the “noisy” image to make predictions rather than the text (prompt). This is mainly because the text vector might be only weakly related to the image, so the model still relies on the noisy image to predict the output.
There is a small trick for this: Classifier-Free Guidance (CFG). During training, the text condition is occasionally set to empty so the model produces unguided images. That way, the model can compare the “guided” case and the “unguided” case. “Guided” means there is a text description, and “unguided” means there is none. By tuning the “strength” of the images generated under these two cases, that is, by balancing between them, the model can more accurately produce images that match the description.
Other types of conditional generation models
- Img2Img
- Used for image-to-image translation, such as style transfer or image super-resolution.
- Inpainting
- Image restoration. Basically, it predicts the masked region based on the un-masked area of the image.
- Depth2Img
Fine-tuning with DreamBooth
You can teach the model some new concepts. That said, DreamBooth fine-tuning is very sensitive on Stable Diffusion models.
Components of Stable Diffusion
- vae: latent diffusion, used to compress the image
- text_encoder: vectorizes the prompt
- tokenizer: breaks text down into the smallest units the model can process
- unet: combines the latent space vector and the text feature vector and starts generating the image
- scheduler: guides the UNet through a series of noise-adding and denoising iterations, gradually refining the image
- safety_checker: the safety checker’s role is to make sure the generated image does not contain inappropriate content
- feature_extractor:
ChangeLog
- 20231105–init
- 20260501–translate by claude code