Download PDF Version

About this keynote: This keynote provides an overview of diffusion models for image synthesis, tracking the timeline of novel works and highlighting key contributions from OpenAI.

Timeline of Novel Works on Diffusion Models in OpenAI

OpenAI (Tim Brooks, Yang Song)

iGPT (2020) → ADM (2021) → DALL-E (2021) → GLIDE (2022) → DALL-E 2 (2022) → DALL-E 3 (2023) → Consistency Models (2023, 2024, 2025)


Broad Applications of Diffusion Models

Diagram of diffusion model for drug discovery
Drug Discovery [Huang et al., 2024]
Diagram of Diffusion-LM for NLP
Natural Language Processing [Li et al., 2022]
Diagram of diffusion model for audio synthesis
Audio Synthesis [Kong et al., 2020]
Diagram of cascaded diffusion model for computer vision
Computer Vision [Ho et al., 2022]

Deep Generative Learning

Learning to generate data

Deep Generative Learning
Samples from a Data Distribution → Train → Neural Network → Sample
Source: CVPR 2023 Tutorial: Denoising Diffusion-Based Generative Modeling.

The Landscape of Deep Generative Learning

The Landscape of Deep Generative Learning
Source: CVPR 2023 Tutorial: Denoising Diffusion-Based Generative Modeling.

OpenAI - iGPT (Chen et al., 2020)

iGPT Framework
The iGPT framework, showing the process from image to sequence and the autoregressive and BERT-based training objectives.
iGPT class-unconditional samples
Class-unconditional samples from iGPT-L trained on input images of resolution 96×96.

OpenAI - Improved DDPM (Nichol & Dhariwal, 2021)

DDPM Illustration DDPM Illustration
DDPM illustration and forward/reverse process.
Iters T Schedule Objective NLL FID
200K 1K linear Lsimple 3.99 32.5
200K 4K linear Lsimple 3.77 31.3
200K 4K linear Lhybrid 3.66 32.2
200K 4K cosine Lsimple 3.68 27.0
200K 4K cosine Lhybrid 3.62 28.0
200K 4K cosine Lvlb 3.57 56.7
1.5M 4K cosine Lhybrid 3.57 19.2
1.5M 4K cosine Lvlb 3.53 40.1
Ablating schedule and objective on ImageNet 64 × 64.
IDDPM Generated Samples
Class-conditional ImageNet 64 × 64 samples generated using 250 sampling steps from L-hybrid model (FID 2.92) which shows high diversity.

OpenAI - Ablated Diffusion Model (ADM) (Dhariwal & Nichol, 2021)

Ablated Diffusion Model (ADM)
The architecture of the Ablated Diffusion Model (ADM).
This slide is derived from https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Group-6-Paper-2-Dalle-2.pdf.

Classifier Guidance (Dhariwal & Nichol, 2021)

Classifier Guidance
Using a classifier to guide the diffusion model to generate more class-consistent images.

OpenAI – GLIDE (Nichol et al., 2022)

GLIDE Framework
The GLIDE framework, which uses a text encoder to guide the ADM model.
GLIDE Framework
Photorealistic Image Samples.
GLIDE Results
Comparison of Image Inpainting Quality on Real Images.

DALL-E (Ramesh et al., 2021)

Two stage of Training for DALL-E.
This slide is derived from https://www.youtube.com/watch?v=oENCNi4JxPY.
Image Credit: https://openai.com/index/dall-e/.

OpenAI – unCLIP/DALL-E 2 (Ramesh et al., 2022)

DALL-E 2 Capabilities: Inferring contextual details DALL-E 2 Capabilities: Zero-shot visual reasoning
DALL-E Capabilities: Inferring contextual details and Zero-shot visual reasoning.
Image Credit: https://openai.com/index/dall-e/.
unCLIP/DALL-E 2 Framework
The unCLIP/DALL-E 2 framework, showing the prior and decoder architecture.
Diffusion Prior
The Diffusion Prior model in DALL-E 2.
This slide is derived from https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Group-6-Paper-2-Dalle-2.pdf.
Importance of the Prior
Comparison showing why the prior with CLIP image embedding is crucial for high-quality image generation.
This slide is derived from https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Group-6-Paper-2-Dalle-2.pdf.
Text Diffs and Variations - Landscape
Text diffs applied to images and variations of an input image, preserving semantic and stylistic elements.
Text Diffs and Variations - Source Text Diffs and Variations - Variants
Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like the overlapping strokes in the logo, as well as stylistic elements like the color gradients in the logo, while varying the non-essential details.

OpenAI – DALL-E 3 (Betker et al., 2023)

DALL-E 3 Results
Compelling results from DALL-E 3 with detailed text prompt.
DALL-E 3 Alt-text Examples
Examples of alt-text accompanying selected images scraped from the internet, short synthetic captions (SSC), and descriptive synthetic captions (DSC).

OpenAI – Consistency Training & Consistency Distillation (Song et al., 2023)

Consistency Model Samples
Samples generated by EDM (top), CT + single-step generation (middle), and CT + 2-step generation (Bottom). All corresponding images are generated from the same initial noise.
Consistency Model Applications
Applications of consistency models: (a) Colorization, (b) Super-resolution, and (c) Stroke-guided image editing.
Consistency Models
Tradeoff between FID and effective sampling compute; consistency models enable faster inference with competitive fidelity.
Source: Lu & Song, ICLR 2025.
Animated demonstration of consistency model (left) and diffusion model (right) generation processes.
Diffusion vs Consistency Model Sampling
Illustration on diffusion model sampling (red) and consistency model sampling (blue).

References

  1. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative Pretraining From Pixels. Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (pp. 1691--1703).
  2. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (pp. 6840--6851).
  3. Song, Y., & Ermon, S. (19). Generative Modeling by Estimating Gradients of the Data Distribution. Advances in Neural Information Processing Systems 33 (NeurIPS 2019).
  4. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).
  5. Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. Proceedings of the 38th International Conference on Machine Learning (ICML 2021) (pp. 8162--8171).
  6. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. Proceedings of the 38th International Conference on Machine Learning (ICML 2021) (pp. 8821--8831).
  7. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv preprint arXiv:2112.10741.
  8. Ramesh, A., Nichol, A., Chu, M., & others (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.
  9. Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
  10. Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. Proceedings of the 40th International Conference on Machine Learning (ICML 2023).
  11. OpenAI. (2023). DALL·E 3. https://openai.com/dall-e-3.
  12. Song, Y., & Dhariwal, P. (2024). Improved Techniques for Training Consistency Models. ICLR.
  13. Lu, C., & Song, Y. (2025). Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models. ICLR 2025.
  14. Huang et al. (2024). A Dual Diffusion Model Enables 3D Molecule Generation and Lead Optimization Based on Target Pockets. NC.
  15. Ho et al. (2022). Cascaded Diffusion Models for High Fidelity Image Generation. JMLR.
  16. Li et al. (2022). Diffusion-LM Improves Controllable Text Generation. NeurIPS.
  17. Kong et al. (2020). DiffWave: A Versatile Diffusion Model for Audio Synthesis. ICLR.

Contact: bili_sakura@zju.edu.cn

© 2024 Sakura