Diffusion models are a class of generative model that learns to create images by reversing a noise-addition process. During training, Gaussian noise is progressively added to images; the model learns to denoise step-by-step. At inference, starting from pure noise, it iteratively denoises to produce a coherent image.

Key Points

  • Forward process: gradually add Gaussian noise to an image over T timesteps until it is pure noise
  • Reverse process: learn to predict and remove noise at each step (the model's job)
  • U-Net backbone: the denoising network; skip connections preserve fine-grained spatial detail
  • CLIP guidance: condition image generation on text by using a contrastive text-image model
  • Latent Diffusion: denoise in a compressed latent space (used by Stable Diffusion) — 4× faster
  • Classifier-Free Guidance (CFG): improves text-image alignment; higher CFG = more adherence to prompt
  • Sampling methods: DDPM (1000 steps), DDIM (50 steps), DPM++ (15–20 steps) — faster inference
  • ControlNet: add spatial control — use depth maps, pose skeletons, edges as additional conditions
  • InPainting: regenerate masked portions of an image while preserving the rest
  • Video generation: extend diffusion to temporal sequences (Sora uses diffusion + Transformer)
ModelCompanyKey Feature
Stable DiffusionStability AI (open)Open weights, local runs
DALL-E 3OpenAITight text coherence via GPT-4 captions
MidjourneyMidjourneyArtistic quality, Discord-based
Imagen 3Google DeepMindPhotorealism, text rendering
FluxBlack Forest LabsOpen, high fidelity
SoraOpenAIText-to-video, minutes long

Real-World Example

Adobe Firefly integrates diffusion models into Photoshop Generative Fill — allowing designers to erase objects, extend backgrounds, and generate design elements from text. Stability AI's Stable Diffusion 3.5 runs on consumer laptops, enabling entirely local image generation.