An Introduction to Visual Generation

Download PDF Version

About this keynote: This post summarizes a survey-style introduction to visual generation, tracing how deep generative models extend from unconditional image synthesis toward increasingly rich conditional controls and advanced applications.

Preliminary

Deep Generative Models

In early days, people from different groups explored visual generative models via denoising diffusion probabilistic models (DDPM) [Ho et al., 2020] and score-based generative models (NCSN) [Song and Ermon, 2019]. Later on, they were interpreted under the same framework called Score-SDE [Song et al., 2021]. After that, novel frameworks such as Flow Matching [Lipman et al., 2023] emerged, which are out of the scope of Score-SDE.

Score-SDE illustration — An illustration of deep generative models [Song et al., 2021]

Stochastic interpolant paradigm — The stochastic interpolant paradigm [Albergo et al., 2025]

Nowadays, a promising unified framework called stochastic interpolants [Albergo et al., 2025] unifies most generative modeling frameworks. However, there is still a lack of common sense to name all these model groups with a unified category. Hence, for simplicity and clarity, we refer to them as deep generative models throughout this presentation.

Landscape of deep generative learning — The landscape of deep generative learning.

Outline

Unconditional/Class-Conditional Image Generation
Text-to-Image Generation
Controllable Image Generation
Image Editing
Image-to-Image Translation
Video Generation
Advanced Applications
- Inversion Task
- Representation Extractor
- Generalist Vision Learner
- 3D/4D Generation & World Modeling
- etc.

One Training Objective

“There is only one precise way of presenting the laws, and that is by means of differential equations. They have the advantage of being fundamental and, so far as we know, precise.”

— Richard P. Feynman

Many deep generative models share one training recipe [Lai et al., 2025]: fit a network to a noisy version of the data, weighted over time.

\[\mathcal{L}(\phi) := \mathbb{E}_{\mathbf{x}_0,\,\boldsymbol{\epsilon}} \left[ \underbrace{ \mathbb{E}_{p_{\mathrm{time}}(t)} }_{\text{time distribution}} \left[ \underbrace{ \omega(t) }_{\text{time weighting}} \underbrace{ \left\| \mathrm{NN}_{\phi}\!\bigl(\mathbf{x}_t, t\bigr) - \bigl(A_t \mathbf{x}_0 + B_t \boldsymbol{\epsilon}\bigr) \right\|_2^2 }_{\text{MSE part}} \right] \right]\]

The design space can be summarized along four axes:

(A) Noise schedule in the forward process of ( \mathbf{x}_t ) via ( \alpha_t ) and ( \sigma_t ).
(B) Prediction types of ( \mathrm{NN}_{\phi} ) and regression targets ( \bigl(A_t \mathbf{x}_0 + B_t \boldsymbol{\epsilon}\bigr) ).
(C) Time-weighting function ( \omega(\cdot) : [0, T] \to \mathbb{R}_{\geq 0} ).
(D) Time distribution ( p_{\mathrm{time}} ).

Unconditional Image Generation

This section covers ( p(\mathbf{x}) ) and class-conditional ( p(\mathbf{x} \mid c) ), where ( c ) is a discrete class label (e.g., an ImageNet category)—not a text prompt or visual control signal.

ADM ImageNet 512 samples — Selected samples from the best ImageNet \( 512\times512 \) model (FID 3.85) from ADM [Dhariwal and Nichol, 2021].

BigGAN-deep [Brock et al., 2019] vs. ADM-G (classifier guidance ADM) vs. ADM-U (unconditional ADM).

Res.	Model	FID	sFID	Prec	Rec
128²	BigGAN-deep	6.02	7.18	0.86	0.35
	ADM-G	2.97	5.09	0.78	0.59
	ADM-U	5.91	5.09	0.70	0.65
256²	BigGAN-deep	6.95	7.36	0.87	0.28
	ADM-G	4.59	5.25	0.82	0.52
	ADM-U	10.94	6.02	0.69	0.63
512²	BigGAN-deep	8.43	8.13	0.88	0.29
	ADM-G	7.72	6.57	0.87	0.42
	ADM-U	23.24	10.19	0.73	0.60

Text-to-Image Generation

Synthesize images from natural language descriptions. Latent Diffusion Models (LDM; Stable Diffusion) [Rombach et al., 2022] led the era of latent modeling in research and AIGC in business.

Stable Diffusion U-Net architecture [Tian et al., 2024]

Controllable Image Generation

Steer generation with spatial, structural, or semantic controls. ControlNet [Zhang et al., 2023] injects auxiliary conditions (e.g., edges, pose) into a frozen diffusion U-Net.

ControlNet examples — Canny-edge and human-pose control.

ControlNet architecture [Zhang et al., 2023]

Image Editing

Move from text-to-image ( p(x\mid c_T) ) to instruction-guided editing ( p(x\mid c_T, c_I) ), where ( c_I ) is the input image. InstructPix2Pix [Brooks et al., 2023] concatenates the input image latent with the noisy latent along the channel dimension, introducing ( < )0.1M trainable parameters.

InstructPix2Pix examples — Instruction-based image editing [Brooks et al., 2023].

Image-to-Image Translation

Translate images across domains by learning a diffusion bridge between source and target distributions. DDBM [Zhou et al., 2024] generalizes score-based diffusion to Schrödinger-bridge image translation.

DDBM bridge illustration — Denoising diffusion bridge from source to target image [Zhou et al., 2024].

Video Generation

Extend image generation to coherent temporal synthesis with efficient diffusion transformers. SANA-Video [Chen et al., 2026] uses a deep compression VAE for spatiotemporal latent modeling, together with block linear DiT and autoregressive block training for efficient text-to-video generation.

SANA-Video architecture — Deep compression VAE plus block linear diffusion transformer for efficient video generation [Chen et al., 2026].

Advanced Applications

Beyond standard synthesis: inversion, representation learning, and world modeling.

Inversion Task

Solve inverse problems with a pre-trained diffusion generative model—no further training required. DDRM [Kawar et al., 2022] restores images from degraded observations via diffusion priors.

DDRM restoration results — Diffusion-based restoration for super-resolution, deblurring, inpainting, and colorization [Kawar et al., 2022].

Representation Extractor

Use pre-trained diffusion models as representation extractors. DiffSeg [Tian et al., 2024] performs zero-shot segmentation from Stable Diffusion attention maps.

DiffSeg details — (a) Attention aggregation: lower-resolution attention maps are upsampled and duplicated to match higher-resolution receptive fields. (b) NMS: maximum activation across the \( L_p \) proposals for each pixel.

Generalist Vision Learner

Build unified vision systems from large-scale generative pre-training. Vision Banana [Gabeur et al., 2026] shows image generators can be generalist vision learners for both generation and understanding.

Vision Banana generalist vision — A generalist vision model for generation and perception [Gabeur et al., 2026].

3D/4D Generation & World Modeling

Generate navigable spatiotemporal worlds and learn interactive environment models from video. YUME 1.5 [Mao et al., 2025] extends YUME [Mao et al., 2025] with text-controlled exploration, trained on Sekai [Li et al., 2025].

YUME interactive world generation — Text-controlled interactive world generation [Mao et al., 2025].

Takeaways

We start from an unconditional image generator ( p(x) ) and progressively add more conditions to it. Technically, this is representation alignment: new condition signals—text, segmentation maps, source images, etc.—are injected into a pre-trained denoiser.
Besides their powerful generative capability, deep generative models are also representation learners [Yang and Wang, 2023], often matching or even surpassing SSL methods such as MoCo on downstream recognition.
Joint generative modeling ( p(x,y) ) can natively unify generation and understanding via Bayes’ rule: marginalize to obtain ( p(x)=\int p(x,y)\,dy ) for synthesis, and derive ( p(y\mid x)=p(x,y)/p(x) ) for discriminative inference—without training a separate head. A more robust foundation model built on this principle remains under-explored.
We are still on the way toward a generalist, task-agnostic, any-to-any foundation model [Gabeur et al., 2026], [Zuo et al., 2025].
Building strong generative models is promising and already feasible, but open challenges remain in efficient training, sampling, alignment, strong generalization, and more.
Today’s visual generation can exceed human perceptual limits—how do we properly evaluate it?

References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, 33, 6840–6851.

Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. In International Conference on Learning Representations.

Albergo, M., Boffi, N. M., & Vanden-Eijnden, E. (2025). Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. Transactions on Machine Learning Research, 26(209), 1–80.

Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The Principles of Diffusion Models. arXiv:2510.21890.

Dhariwal, P., & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, 34, 8780–8794.

Brock, A., Donahue, J., & Simonyan, K. (2019). Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695.

Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3813–3824.

Brooks, T., Holynski, A., & Efros, A. A. (2023). InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402.

Zhou, L., Lou, A., Khanna, S., & Ermon, S. (2024). Denoising Diffusion Bridge Models. In International Conference on Learning Representations.

Chen, J., Zhao, Y., Yu, J., Chu, R., Chen, J., Yang, S., Wang, X., Pan, Y., Zhou, D., Ling, H., Liu, H., Yi, H., Zhang, H., Li, M., Chen, Y., Cai, H., Fidler, S., Luo, P., Han, S., & Xie, E. (2026). SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer. In International Conference on Learning Representations.

Kawar, B., Elad, M., Ermon, S., & Song, J. (2022). Denoising Diffusion Restoration Models. In Advances in Neural Information Processing Systems.

Tian, J., Aggarwal, L., Colaco, A., Kira, Z., & Gonzalez-Franco, M. (2024). Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation Using Stable Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3554–3563.

Gabeur, V., Long, S., Peng, S., Voigtlaender, P., Sun, S., Bao, Y., Truong, K., Wang, Z., Zhou, W., Barron, J. T., Genova, K., Kannen, N., Ben, S., Li, Y., Guo, M., Yogin, S., Gu, Y., Chen, H., Wang, O., Xie, S., Zhou, H., He, K., Funkhouser, T., Alayrac, J.-B., & Soricut, R. (2026). Image Generators Are Generalist Vision Learners. arXiv:2604.20329.

Mao, X., Li, Z., Li, C., Xu, X., Ying, K., He, T., Pang, J., Qiao, Y., & Zhang, K. (2025). Yume-1.5: A Text-Controlled Interactive World Generation Model. arXiv:2512.22096.

Mao, X., Lin, S., Li, Z., Li, C., Peng, W., He, T., Pang, J., Chi, M., Qiao, Y., & Zhang, K. (2025). Yume: An Interactive World Generation Model. arXiv:2507.17744.

Li, Z., Li, C., Mao, X., Lin, S., Li, M., Zhao, S., Pan, X. Z., Li, X., Feng, Y., Sun, J., Li, Z., Zhang, F., Ai, J., Wang, Z., Wu, Y., He, T., Jia, Y., & Zhang, K. (2025). Sekai: A Video Dataset towards World Exploration. In NeurIPS Datasets and Benchmarks Track.

Yang, X., & Wang, X. (2023). Diffusion Model as Representation Learner. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18938–18949.

Zuo, J., Deng, H., Zhou, H., Zhu, J., Zhang, Y., Zhang, Y., Yan, Y., Huang, K., Chen, W., Deng, Y., Jin, R., Sang, N., & Gao, C. (2025). Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets. arXiv:2512.15110.

Contact: bili_sakura@zju.edu.cn