The Evolution of Visual Generative Foundation Models

Download PDF Version

About this keynote: This post summarizes a slide deck on the evolution of visual generative foundation models, with emphasis on diffusion-family methods validated on ImageNet class-conditional generation.

TL;DR

Visual generation moved from convolution-heavy U-Net pipelines to transformer-first denoisers, and quality now tracks scaling behavior (data, compute, and architecture) more predictably than handcrafted tricks. Latent diffusion made high-resolution generation practical; DiT and its descendants made scaling cleaner; and recent methods such as NiT show that native-resolution training/inference can outperform fixed-resolution pipelines under similar compute budgets.

Timeline of diffusion model perspectives.

Generative Model Family

The broad landscape of deep generative modeling includes auto-regressive models, variational methods, GANs, and diffusion/score/flow families. In this keynote, the scope is intentionally narrower: diffusion-style generative foundation models for visual synthesis, primarily benchmarked on ImageNet.

Overview of different generative model types — Overview of representative generative model types.

Overview of deep generative modeling in the diffusion era.

Two empirical signals motivate this focus:

The research community shifted from GAN-dominant generation to diffusion-dominant generation.
A practical “universal pipeline” emerged around diffusion-style noising/denoising and conditioning.

GAN versus diffusion research trend — Research trend: GANs vs diffusion models.

Universal diffusion-based visual generation pipeline — A universal diffusion-based visual content generation pipeline.

Latent Generative Foundation Models

Latent Diffusion Models (LDM)

LDM introduced a crucial systems idea: run diffusion in compressed latent space instead of pixel space, then decode back to image space. This reduces compute while preserving visual quality, enabling practical high-resolution synthesis.

Latent Diffusion Model architecture — LDM architecture: diffusion in latent space with a learned autoencoder.

LDM ImageNet 256 benchmark results — ImageNet-256 results from the LDM paper.

Diffusion Transformer (DiT)

DiT replaced the U-Net denoiser with a transformer backbone and used adaptive normalization for condition injection. This change unlocked smoother scaling with model size and training compute.

Diffusion Transformer architecture — DiT architecture and conditioning design.

DiT’s key insight is not only architecture swap, but scale behavior: quality correlates strongly with forward-pass FLOPs.

DiT scaling trend between GFLOPs and FID — DiT scaling: GFLOPs and sample quality trends.

DiT sample quality scaling trend — Higher transformer compute generally improves sample quality.

DiT scaling bubble chart — Scaling visualization across model size, compute, and quality.

SiT: Interpolant-Based Scaling

Scalable Interpolant Transformers (SiT) keep a DiT-like backbone but reframe training under stochastic interpolants, unifying diffusion and flow perspectives. At matched scales, SiT variants improve FID over corresponding DiT baselines.

SiT XL generated samples — Samples from SiT-XL models at different resolutions.

FiT and FiTv2: Flexible Resolution Transformers

FiT extends transformer diffusion to flexible aspect ratios and resolutions without retraining separate models per fixed size. The recipe combines flexible tokenization, padded or masked attention strategies, and robust positional encoding choices.

Flexible Vision Transformer architecture overview — FiT/FiTv2 flexible training and inference design.

NiT: Native-Resolution Image Synthesis

NiT pushes beyond padded flexible-resolution setups: it processes native, heterogeneous token lengths directly. This reduces waste from padding and improves quality at high, non-square, or long-tail resolutions.

Native-resolution Transformer architecture — NiT architecture for native-resolution diffusion transformers.

ImageNet resolution distribution motivates native-resolution processing.

NiT native-resolution synthesis samples — Native-resolution synthesis examples from NiT.

NiT benchmark and synthesis summary slide — NiT benchmark summary: stronger high-resolution extrapolation under constrained token budgets.

Beyond Standard Attention: DiM, DiG, and DiT-MoE

Current work also explores alternative backbone primitives and scaling strategies:

DiM and DiG study state-space and gated-linear-attention style backbones for efficiency.
DiT-MoE uses sparse mixture-of-experts blocks to scale parameter count aggressively while managing active compute.

DiM architecture overview — DiM: diffusion with a Mamba-style backbone.

DiG architecture overview — DiG: diffusion with gated linear attention.

DiT-MoE architecture overview — DiT-MoE: sparse expert scaling for large diffusion transformers.

Pixel Generative Foundation Models

Pixel-space generative foundation models are undergoing a renewed wave of progress, but this keynote version focuses on latent-first and latent-tokenized transformer lines. A complete pixel-space chapter (e.g., end-to-end pixel DiT variants) is planned as a follow-up update.

Takeaways

Latent modeling was the practical bridge from academic diffusion to large-scale generative foundation models.
Transformer denoisers are now the dominant scaling path because they map cleanly to modern compute and long-range modeling.
Resolution flexibility is a first-class requirement: FiT/FiTv2 and especially NiT show gains from handling native aspect ratios and token lengths.
The design space is expanding beyond dense self-attention, including interpolant formulations, linear/gated attention, SSM-style blocks, and MoE scaling.
Future competition is likely about efficiency-quality trade-offs at scale, not only absolute benchmark wins.

References

Weng, L. (2021). What Are Diffusion Models? lilianweng.github.io.

Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The Principles of Diffusion Models. arXiv:2510.21890.

Ma, Z., Zhang, Y., Jia, G., et al. (2025). Efficient Diffusion Models: A Comprehensive Survey From Principles to Practices. IEEE TPAMI.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. CVPR.

Esser, P., Rombach, R., & Ommer, B. (2021). Taming Transformers for High-Resolution Image Synthesis. CVPR.

Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV.

Ma, N., Goldstein, M., Albergo, M. S., et al. (2024). SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers. ECCV.

Lu, Z., Wang, Z., Huang, D., et al. (2024). FiT: Flexible Vision Transformer for Diffusion Model. ICML.

Wang, Z., Lu, Z., Huang, D., et al. (2024). FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model. arXiv:2410.13925.

Wang, Z., Bai, L., Yue, X., Ouyang, W., & Zhang, Y. (2025). Native-Resolution Image Synthesis. NeurIPS.

Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.

Teng, Y., Wu, Y., Shi, H., et al. (2024). DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis. arXiv:2405.14224.

Zhu, L., Huang, Z., Liao, B., et al. (2025). DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention. CVPR.

Fei, Z., Fan, M., Yu, C., Li, D., & Huang, J. (2024). Scaling Diffusion Transformers to 16 Billion Parameters. arXiv:2407.11633.