TL;DR
As diffusion systems scale, the biggest wins tend to come from leveraging compute with broad, general methods rather than hand-crafting ever more specific tricks
Preliminaries: Diffusion Models for Image Generation
Diffusion models have emerged as a powerful paradigm for generative modeling by learning to reverse a gradual noise corruption process. The fundamental approach involves two key stages: a forward diffusion process that systematically adds noise to data until it becomes pure Gaussian noise, and a reverse denoising process where a neural network gradually removes this noise to generate new samples.
This framework has demonstrated remarkable success across diverse domains including image generation, audio synthesis, video generation, and even applications in natural language processing and molecular design. The generality of the diffusion framework makes it particularly attractive for complex generative tasks.
For readers seeking a comprehensive introduction to diffusion model fundamentals, we recommend Yang Song’s excellent exposition on score-based generative modeling
Interactive Timeline
The U-Net Era
The early pioneering works in diffusion-based image generation predominantly adopted U-Net architectures
The foundational models in this era established the core principles of diffusion-based generation. NCSN (Noise Conditional Score Network)
The breakthrough for text-to-image generation came with LDM (Latent Diffusion Models, also known as Stable Diffusion)
| Model | Gen. (#Param) | Txt. (#Param) | Total (#Param) | Release Date |
|---|---|---|---|---|
| SD v2.1 |
0.87B | 0.34B | 1.29B | 2022-12-07 |
| Kandinsky |
1.23B | 0.56B | 1.86B | 2023-01-01 |
| UniDiffuser |
0.95B | 0.12B | 1.25B | 2023-05-12 |
| SDXL |
2.57B | 0.82B | 3.47B | 2023-06-25 |
| Kandinsky 3 |
3.06B | 8.72B | 12.05B | 2023-12-11 |
| Stable Cascade (Würstchen) |
1.56B | 0.69B | 2.28B | 2024-02-07 |
The standard U-Net architecture for diffusion models typically consists of an encoder that progressively downsamples the noisy input, a bottleneck middle block that processes compressed representations, and a decoder that upsamples back to the original resolution. Crucially, skip connections preserve fine-grained spatial information across corresponding encoder and decoder stages.
Non-text conditioning: two-stage cascades and U‑ViT
Beyond text conditioning, U-Net backbones evolved substantially in non-text settings (unconditional or class-conditional), focusing on sample quality, stability, and compute efficiency:
- Two-stage/cascaded U-Nets: Decompose generation into a low-resolution base diffusion model and one or more super-resolution diffusion upsamplers. The base model captures global structure; specialized upsamplers (e.g., SR3) iteratively refine detail at higher resolutions. This cascade improves fidelity and stability on ImageNet and face datasets while keeping training tractable
. - U‑ViT (ViT backbone in a U‑shaped design): Replace CNN residual blocks with Vision Transformer blocks while retaining long-range skip connections. U‑ViT tokenizes noisy image patches, timesteps, and (optionally) class tokens, enabling stronger global context modeling than CNN U‑Nets and achieving competitive ImageNet FID with comparable compute
.
Key takeaways (non-text U-Net family)
- Cascades separate global structure (base) from high-frequency detail (super‑res), scaling quality to high resolutions efficiently.
- ViT backbones in U‑shaped layouts preserve inductive benefits of skip connections while capturing long-range dependencies.
- These ideas later influenced text-to-image systems (e.g., two‑stage SDXL) even as the field transitioned toward DiT backbones.
The DiTs Era
As U-Net–based models began to hit a scaling ceiling (e.g., SDXL with ~2.6B parameters
| Model | Gen. (#Param) | Txt. (#Param) | Total (#Param) | Release Date |
|---|---|---|---|---|
| PixArt-$\alpha$ |
0.61B | 4.76B | 5.46B | 2023/10/06 |
| Lumina-T2I |
~4.7B | ~7B | ~15B | 2024/04/01 |
| PixArt-$\Sigma$ |
0.61B | 4.76B | 5.46B | 2024/04/11 |
| Lumina-Next-T2I |
1.75B | 2.51B | 4.34B | 2024/05/12 |
| Stable Diffusion 3 |
2.03B | 5.58B | 7.69B | 2024/06/12 |
| Flux.1-Dev |
11.90B | 4.88B | 16.87B | 2024/08/02 |
| CogView3-Plus |
2.85B | 4.76B | 8.02B | 2024/10/13 |
| Hunyuan-DiT |
1.50B | 2.02B | 3.61B | 2024/12/01 |
| SANA |
0.59B | 2.61B | 3.52B | 2025/01/11 |
| Lumina-Image 2.0 |
2.61B | 2.61B | 5.31B | 2025/01/22 |
| SANA 1.5 |
1.60B | 2.61B | 4.53B | 2025/03/21 |
| HiDream-I1-Dev |
17.11B | 5.58B | 22.77B | 2025/04/06 |
| CogView4-6B |
3.50B | 2.00B | 6.00B | 2025/05/03 |
| Qwen-Image |
20.43B | 8.29B | 28.85B | 2025/08/04 |
Latest Advancement in U-Net and DiT Architecture Design
While the transition from U-Net to DiT architectures represents a major paradigm shift, both architectural families have continued to evolve with innovative refinements. In the U-Net domain, two-stage cascaded approaches
“The DiT family has seen rapid advances across multiple dimensions. Architecture variants include SiT (Scalable Interpolant Transformer), which replaces diffusion with interpolant-based transport for improved stability, and LiT (Linear Diffusion Transformer)
Pre-trained Text-to-Image Checkpoints
The landscape of pre-trained text-to-image models has evolved dramatically since the introduction of Stable Diffusion. These models serve as powerful foundation models that can be adapted for specialized downstream tasks without architectural modifications, simply by fine-tuning on domain-specific datasets.
Interactive Architecture Explorer
U-Net Family
Stable Diffusion
Stable Diffusion XL (SDXL)
Kandinsky
Stable Cascade (based on Würstchen architecture)
UniDiffuser
PixArt-$\alpha$ (2023/10/06)
PixArt-$\alpha$ is motivated by the rising compute and environmental costs of text-to-image systems, seeking near-commercial quality with a much smaller training budget
Architecturally, PixArt-$\alpha$ is a latent Diffusion Transformer (DiT): VAE latents are patchified into a token sequence processed by stacked Transformer blocks; each block applies cross-attention to text tokens, and timestep conditioning is injected via a shared adaLN-single, simplifying parameters and conditioning pathways
- Transformer sequence-of-patches backbone (no encoder–decoder or skip connections)
- Shared adaLN for time and unified per-block cross-attention (vs U-Net residual blocks with per-block time MLP/spatial injections)
- T5 text encoder (LLM) rather than CLIP/OpenCLIP
Lumina-T2I (2024/04/01)
Lumina-T2I is the first entry in the Lumina series from Shanghai AI Lab, aiming for a simple, scalable framework that supports flexible resolutions while maintaining photorealism. Building on the Sora insight that scaling Diffusion Transformers enables generation across arbitrary aspect ratios and durations yet lacks concrete implementation details, Lumina-T2I adopts flow matching to stabilize and accelerate training
Architecturally, Lumina-T2I uses a Flow-based Large Diffusion Transformer (Flag-DiT) with zero-initialized attention, RoPE
- Robust resolution generalization across 512²–1792²
- Uses one-dimensional RoPE, [nextline] token, and layerwise relative position injection
- PixArt-α uses absolute positional embeddings limited to the initial layer, degrading at out-of-distribution scales
PixArt-$\Sigma$ (2024/04/11)
PixArt-$\Sigma$ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-$\Sigma$’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
Architecturally, PixArt-$\Sigma$ maintains the same DiT backbone as PixArt-$\alpha$ but introduces efficient token compression through a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. The model incorporates superior-quality image data paired with more precise and detailed image captions, along with data curriculum strategies for improved training effectiveness.
- Efficient token compression via novel attention module compressing keys and values
- Superior-quality training data with more precise and detailed captions
- Data curriculum strategies for improved training effectiveness
- 4K image generation capability for high-resolution content creation
Lumina-Next-T2I (2024/05/12)
Lumina-Next-T2I
Architecturally, Lumina-Next introduces the Next-DiT backbone with 3D RoPE and Frequency- and Time-Aware Scaled RoPE for robust resolution extrapolation
- Next-DiT with 3D RoPE + frequency/time-aware scaling for stronger resolution extrapolation
- Sandwich normalizations improve stability; sigmoid time schedule reduces sampling steps
- Context Drop merges redundant tokens for faster inference throughput
- Decoder-only LLM text encoders (Gemma-2B by default; Qwen-1.8B/InternLM-7B optional) boost zero-shot multilingual alignment vs CLIP/T5
Stable Diffusion 3 (2024/06/12)
Stable Diffusion 3 aims to improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales, demonstrating superior performance compared to established diffusion formulations for high-resolution text-to-image synthesis
Architecturally, SD3 transitions from DiT’s cross-attention blocks to MMDiT (Multimodal Diffusion Transformer) with double-stream blocks that use separate weights for the two modalities, enabling bidirectional flow of information between image and text tokens for improved text comprehension and typography. Unlike SDXL which relies primarily on CLIP encoders, SD3 incorporates both CLIP (L/14 and OpenCLIP bigG/14) and T5-XXL encoders
- MMDiT double-stream architecture with separate weights per modality and bidirectional information flow (vs single-stream cross-attention)
- Integrated rectified flow training with perceptually-biased noise sampling (vs standard diffusion formulation)
- Combined CLIP + T5-XXL text encoding for enhanced text comprehension and typography
- First comprehensive scaling study demonstrating predictable trends for text-to-image DiTs
Flux.1-Dev (2024/08/02)
Flux.1-Dev, developed by former Stability AI core members, aims to scale beyond previous models and achieve superior image quality with more accurate text-to-image synthesis
Architecturally, Flux.1-Dev advances beyond SD3’s MMDiT by implementing a hybrid architecture that combines both single-stream and double-stream Multi-Modal Diffusion Transformers, enhancing the model’s ability to process complex visual-textual relationships. Like SD3, it incorporates T5 text encoding
- Hybrid single-stream + double-stream MMDiT architecture (vs purely double-stream MMDiT)
- Massive scaling to 12B generator + 4.7B text encoder parameters (vs smaller SD3 variants)
- Enhanced rectified flow implementation optimized for larger scale training
- Comprehensive scaling study specifically designed for multi-billion parameter DiTs
CogView3 & CogView3-Plus (2024/10/13)
CogView3
CogView3-Plus upgrades to DiT architecture with Zero-SNR scheduling and joint text-image attention for further efficiency gains. This architectural evolution represents a significant step in the CogView series, transitioning from traditional approaches to transformer-based diffusion models while maintaining the efficiency advantages of the relay diffusion framework.
Hunyuan-DiT (2024/12/01)
Hunyuan-DiT, developed by Tencent’s Hunyuan team, aims to create a powerful multi-resolution diffusion transformer capable of fine-grained understanding of both English and Chinese languages, addressing the need for state-of-the-art Chinese-to-image generation with culturally relevant and multilingual capabilities
Architecturally, Hunyuan-DiT builds upon PixArt-$\alpha$ by incorporating both single-stream and double-stream Multi-Modal Diffusion Transformer (MM-DiT) blocks similar to SD3, enabling efficient handling of complex image generation tasks across multiple resolutions. The model integrates dual text encoders—CLIP for understanding overall semantic content and T5
- Single-stream + double-stream MM-DiT blocks for enhanced multi-modal processing (vs single-stream cross-attention)
- Dual text encoders (CLIP + T5) for semantic and nuanced language understanding (vs T5 only)
- Multi-resolution diffusion transformer with enhanced positional encoding for robust resolution handling
- Multimodal LLM-refined captions with fine-grained bilingual (English + Chinese) understanding
SANA (2025/01/11)
SANA, developed by NVIDIA, aims to enable efficient high-resolution image synthesis up to 4096×4096 pixels while maintaining deployment feasibility on consumer hardware, generating 1024×1024 images in under a second on a 16GB laptop GPU
Architecturally, SANA advances beyond PixArt-$\Sigma$ by replacing traditional self-attention mechanisms with Linear Diffusion Transformer (Linear DiT) blocks, enhancing computational efficiency at high resolutions without compromising quality. The model adopts a decoder-only small language model as the text encoder, employing complex human instructions with in-context learning to improve text-image alignment compared to conventional CLIP or T5 encoders. The compact 0.6B parameter model achieves competitive performance with substantially larger models like Flux-12B while being 20 times smaller and over 100 times faster in throughput.
- Linear DiT replacing traditional self-attention for O(n) complexity vs O(n²) at high resolutions
- DC-AE with 32× compression reducing latent tokens and memory requirements dramatically
- Decoder-only language model as text encoder with in-context learning (vs T5)
- 0.6B parameters achieving competitive quality with 12B models while 100× faster throughput
Lumina-Image 2.0 (2025/01/22)
Lumina-Image 2.0 aims to provide a unified and efficient image generative framework that excels in generating high-quality images with strong text-image alignment across diverse generation and editing tasks
Architecturally, Lumina-Image 2.0 advances beyond Lumina-Next-T2I by introducing a unified Next-DiT architecture that seamlessly integrates text-to-image generation and image editing capabilities within a shared framework. The model maintains the Lumina series’ architectural strengths including 3D RoPE
- Unified Next-DiT framework seamlessly integrating generation and editing (vs generation-only focus)
- Enhanced multi-task architecture supporting diverse image generation applications within single model
- Optimized training paradigm leveraging shared representations across generation modalities
- Competitive performance across FID and CLIP benchmarks with improved efficiency
SANA 1.5 (2025/03/21)
SANA 1.5 aims to push the boundaries of efficient high-resolution image synthesis established by SANA, offering improved performance and scalability through larger model sizes and advanced inference scaling techniques
Architecturally, SANA 1.5 builds upon the original SANA by incorporating an enhanced DC-AE (deep compression autoencoder) to handle higher resolutions and more complex generation tasks, along with advanced Linear DiT blocks featuring more sophisticated linear attention mechanisms to boost efficiency and quality in high-resolution synthesis. The model scales to 4.8B parameters compared to SANA’s 0.6B, providing a robust solution for generating high-quality images with strong text-image alignment suitable for diverse professional applications requiring both quality and computational efficiency.
- Inference scaling with VISA model for candidate selection dramatically improving GenEval scores (81→96)
- Enhanced DC-AE handling higher resolutions and more complex generation tasks
- Advanced Linear DiT with more sophisticated linear attention mechanisms
- Scaled to 4.8B parameters providing improved quality while maintaining efficiency advantages
HiDream-I1-Dev (2025/04/06)
HiDream-I1, developed by HiDream.ai, addresses the critical trade-off between quality improvements and computational complexity in image generative foundation models, aiming to achieve state-of-the-art image generation quality within seconds while maintaining high efficiency
Architecturally, HiDream-I1 advances beyond Flux.1-Dev and Qwen-Image by implementing a novel sparse DiT structure where only subsets of transformer blocks are activated for each forward pass, dramatically reducing computational costs while maintaining generation quality. The sparse architecture enables the massive 17B parameter model to achieve practical inference speeds comparable to smaller dense models, with efficient diffusion mechanisms supporting multimodal input and providing fine-grained control over generation. This sparse approach represents a paradigm shift in scaling DiT models, demonstrating that architectural efficiency through sparsity can rival quality of substantially denser models.
- Sparse DiT structure activating only subsets of blocks per forward pass for efficient 17B parameter model
- 4K ultra-high-definition generation support with optimized inference speed despite massive scale
- Advanced sparse attention mechanisms maintaining quality while dramatically reducing computational costs
- Multimodal input support and fine-grained control optimized for professional-grade design applications
CogView4-6B (2025/05/03)
CogView4-6B
CogView4-6B leverages GLM-based text encoding and advanced transformer blocks with RoPE (Rotary Position Embedding) for enhanced spatial understanding and text-image alignment. This architectural sophistication enables the model to achieve superior text rendering capabilities, particularly for complex Chinese characters and multilingual content, setting new standards for text-to-image generation in non-Latin scripts. Available on Hugging Face under Apache 2.0 license.
Qwen-Image (2025/08/04)
Qwen-Image represents a monumental scaling achievement in text-to-image synthesis, establishing a new state-of-the-art with its massive 28.85 billion parameter architecture
Architecturally, Qwen-Image employs a massively scaled Multi-Modal Diffusion Transformer (MMDiT) that builds upon the hybrid single- and double-stream designs seen in models like Flux.1-Dev. The generator model alone comprises over 20 billion parameters, combined with a powerful 8.29 billion parameter text encoder for unparalleled language comprehension. This dual-stream approach allows for sophisticated interaction between text and image modalities, enabling precise control over generated content. The model integrates advanced training techniques, including rectified flow and large-scale data curation, to ensure stable and efficient convergence despite its enormous size.
- Massive dense scaling to 28.85B parameters (vs HiDream's 17B sparse architecture)
- Focus on state-of-the-art quality through sheer scale (vs HiDream's focus on efficiency via sparsity)
- Extremely large 8.29B text encoder for superior text-image alignment
- Represents the pinnacle of the dense DiT scaling paradigm before potential shifts to new architectures
Experiments and Case Studies
To comprehensively evaluate the capabilities of different text-to-image diffusion models, we propose a systematic evaluation framework spanning tasks of varying complexity. This section will present case studies of text-to-image generation visualizations using existing checkpoints, assessing their performance across a spectrum of increasingly challenging tasks.
Implementation Details
| Parameter | Value |
|---|---|
| Precision | bfloat16 |
| Scheduler | default |
| Steps | 50 |
| Guidance Scale | 7.5 |
| Resolution | 512×512 |
- There is no strong correlation between image model size and image aesthetics (See case study 4).
- There is no strong correlation between text model size and prompt following (See case study 5).
- Large models generally work better but always the case.
- U-Nets based model perform comparativaly worse than DiTs in the similar model size, for instance, SDXL to SANA, Kandinsky-3 to CogView4.
- StaleDiffusion 3.x continously trained on higher resolution (e.g., 1024px) tends to generate croped results.
- Not all models are capable to dealing with multilingual prompt (see case study 2).
- Commercial model such as GPT-Image model works extremely well in aesthetics, prompt following, counting, text rendering and spatial reasoning.
Why Scaling Favors Attention
As diffusion models scaled in data and compute, the active bottleneck shifted from local fidelity to global semantic alignment, and the community moved accordingly: from U-Nets that hard-wire translation equivariance via convolution to Diffusion Transformers that learn equivariances through self-attention. Let $\mathcal{C}^{\mathrm{conv}}_{G}$ be the class of translation-equivariant, finite-support Toeplitz operators (U-Net convolutional kernels) and $\mathcal{A}^{\mathrm{attn}}$ the class of self-attention kernels with relative positional structure (DiTs). Write $\sqsubseteq^{\mathrm{bias}}$ as “is a constrained instance of (via inductive-bias constraints)”
In plain terms, convolution is a simplified, efficient expression of attention obtained by enforcing fixed translation symmetry, parameter tying, and locality
Further Discussion
From Text-to-Image Generation to Real-World Applications
Text-to-image is now genuinely strong; the next wave is about conditioning existing pixels rather than generating from scratch—turning models into reliable editors that honor what must stay and change only what’s asked. This means prioritizing downstream tasks like image editing, inpainting/outpainting, image-to-image restyling, and structure- or reference-guided synthesis (edges, depth, layout, style, identity). The practical focus shifts from unconstrained novelty to controllable, faithful rewrites with tight mask adherence, robust subject/style preservation, and interactive latencies, so these systems plug cleanly into real creative, design, and industrial workflows.
Diffusion Models vs. Auto-regressive Models
Diffusion models and autoregressive (AR) models represent two fundamentally different approaches to image generation, with the key distinction being that autoregressive models operate on discrete image tokens while diffusion models work with continuous representations. Autoregressive models like DALL-E