Diffusion Models
Image/video generation, noise scheduling, guidance, Stable Diffusion, DALL-E
Diffusion models are a class of generative model that learns to create images by reversing a noise-addition process. During training, Gaussian noise is progressively added to images; the model learns to denoise step-by-step. At inference, starting from pure noise, it iteratively denoises to produce a coherent image.
Key Points
- Forward process: gradually add Gaussian noise to an image over T timesteps until it is pure noise
- Reverse process: learn to predict and remove noise at each step (the model's job)
- U-Net backbone: the denoising network; skip connections preserve fine-grained spatial detail
- CLIP guidance: condition image generation on text by using a contrastive text-image model
- Latent Diffusion: denoise in a compressed latent space (used by Stable Diffusion) — 4× faster
- Classifier-Free Guidance (CFG): improves text-image alignment; higher CFG = more adherence to prompt
- Sampling methods: DDPM (1000 steps), DDIM (50 steps), DPM++ (15–20 steps) — faster inference
- ControlNet: add spatial control — use depth maps, pose skeletons, edges as additional conditions
- InPainting: regenerate masked portions of an image while preserving the rest
- Video generation: extend diffusion to temporal sequences (Sora uses diffusion + Transformer)
| Model | Company | Key Feature |
|---|---|---|
| Stable Diffusion | Stability AI (open) | Open weights, local runs |
| DALL-E 3 | OpenAI | Tight text coherence via GPT-4 captions |
| Midjourney | Midjourney | Artistic quality, Discord-based |
| Imagen 3 | Google DeepMind | Photorealism, text rendering |
| Flux | Black Forest Labs | Open, high fidelity |
| Sora | OpenAI | Text-to-video, minutes long |
Real-World Example
Adobe Firefly integrates diffusion models into Photoshop Generative Fill — allowing designers to erase objects, extend backgrounds, and generate design elements from text. Stability AI's Stable Diffusion 3.5 runs on consumer laptops, enabling entirely local image generation.