A generative model that creates data by learning to reverse a gradual noising process — starting from pure random noise and iteratively refining it into coherent images, audio, or video through learned denoising steps.
In Depth
Diffusion Models generate data through a two-phase process inspired by thermodynamics. In the forward (diffusion) process, real data is gradually corrupted by adding Gaussian noise over many steps until it becomes pure random noise. In the reverse (denoising) process, a neural network learns to undo each step of corruption — predicting and removing the noise at each stage. Once trained, new data is generated by starting from random noise and running the learned denoising process, which progressively sculpts the noise into a coherent output.
Diffusion Models rose to prominence in 2021-2022, surpassing GANs as the dominant paradigm for image generation. DALL-E 2 (OpenAI), Stable Diffusion (Stability AI), Midjourney, and Imagen (Google) all use diffusion-based architectures. Their key advantage over GANs is training stability — they do not suffer from mode collapse or the adversarial training dynamics that make GANs difficult to train. They also offer fine-grained control through techniques like classifier-free guidance, which lets users trade off between fidelity and diversity, and text conditioning via CLIP or T5 embeddings.
The primary limitation of diffusion models is sampling speed — generating an image requires running the denoising network many times (often 20-50 steps), making generation slower than single-pass models like GANs. Research on distillation, consistency models, and adaptive step-size methods has dramatically accelerated sampling. Diffusion has also expanded beyond images to video generation (Sora, Runway), audio synthesis, 3D object generation, molecular design in drug discovery, and even robotic policy generation — establishing it as perhaps the most versatile generative framework in modern AI.
Diffusion Models generate data by learning to denoise — starting from random noise and iteratively refining it into high-quality outputs. They power the current generation of AI image, video, and audio generation.