From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models (2021–2025)

A hero image summarizing the evolution of diffusion model architectures from U-Nets to Transformers. — Diffusion Image Model Architecture Evolution.

TL;DR

As diffusion systems scale, the biggest wins tend to come from leveraging compute with broad, general methods rather than hand-crafting ever more specific tricks . At the same time, we should keep sight of the “hardware lottery”: what succeeds can reflect today’s accelerators and tooling as much as inherent merit .

Preliminaries: Diffusion Models for Image Generation

Diffusion models have emerged as a powerful paradigm for generative modeling by learning to reverse a gradual noise corruption process. The fundamental approach involves two key stages: a forward diffusion process that systematically adds noise to data until it becomes pure Gaussian noise, and a reverse denoising process where a neural network gradually removes this noise to generate new samples.

This framework has demonstrated remarkable success across diverse domains including image generation, audio synthesis, video generation, and even applications in natural language processing and molecular design. The generality of the diffusion framework makes it particularly attractive for complex generative tasks.

Diagram showing the forward noising process and the reverse denoising process in diffusion models. — The Markov chain for the forward and reverse diffusion processes, which generate a sample by slowly adding (and removing) noise. Image Credit:

For readers seeking a comprehensive introduction to diffusion model fundamentals, we recommend Yang Song’s excellent exposition on score-based generative modeling and Lilian Weng’s detailed overview of diffusion models .

Interactive Timeline

The U-Net Era

The early pioneering works in diffusion-based image generation predominantly adopted U-Net architectures as their neural network backbone. This choice was largely influenced by U-Net’s proven success in various computer vision tasks .

The foundational models in this era established the core principles of diffusion-based generation. NCSN (Noise Conditional Score Network) pioneered score-based generative modeling using a RefineNet backbone , while DDPM (Denoising Diffusion Probabilistic Models) established the probabilistic framework using a PixelCNN++ architecture . Subsequent refinements including NCSNv2 , IDDPM , ADM (Ablated Diffusion Model) , and SDE (Score-based Diffusion via Stochastic Differential Equations) built upon these foundations with architectural variations similar to DDPM or NCSN. However, these early models focused primarily on unconditional image generation and lacked text-to-image capabilities.

The breakthrough for text-to-image generation came with LDM (Latent Diffusion Models, also known as Stable Diffusion) , which introduced a latent U-Net architecture combined with a KL-regularized VAE (autoencoder) to enable efficient text-conditioned generation. Following this success, several notable U-Net-based text-to-image models emerged, each exploring different architectural innovations within the U-Net paradigm:

Model	Gen. (#Param)	Txt. (#Param)	Total (#Param)	Release Date
SD v2.1	0.87B	0.34B	1.29B	2022-12-07
Kandinsky	1.23B	0.56B	1.86B	2023-01-01
UniDiffuser	0.95B	0.12B	1.25B	2023-05-12
SDXL	2.57B	0.82B	3.47B	2023-06-25
Kandinsky 3	3.06B	8.72B	12.05B	2023-12-11
Stable Cascade (Würstchen)	1.56B	0.69B	2.28B	2024-02-07

The standard U-Net architecture for diffusion models typically consists of an encoder that progressively downsamples the noisy input, a bottleneck middle block that processes compressed representations, and a decoder that upsamples back to the original resolution. Crucially, skip connections preserve fine-grained spatial information across corresponding encoder and decoder stages.

U-Net backbone used in diffusion models with time conditioning injected into residual blocks and skip connections between encoder and decoder. — A typical U-Net backbone used in diffusion models with time conditioning. Time representation uses sinusoidal positional embeddings or random Fourier features; these time features are injected into residual blocks via simple spatial addition or adaptive group normalization layers. Image Credit: .

Non-text conditioning: two-stage cascades and U‑ViT

Beyond text conditioning, U-Net backbones evolved substantially in non-text settings (unconditional or class-conditional), focusing on sample quality, stability, and compute efficiency:

Two-stage/cascaded U-Nets: Decompose generation into a low-resolution base diffusion model and one or more super-resolution diffusion upsamplers. The base model captures global structure; specialized upsamplers (e.g., SR3) iteratively refine detail at higher resolutions. This cascade improves fidelity and stability on ImageNet and face datasets while keeping training tractable .
U‑ViT (ViT backbone in a U‑shaped design): Replace CNN residual blocks with Vision Transformer blocks while retaining long-range skip connections. U‑ViT tokenizes noisy image patches, timesteps, and (optionally) class tokens, enabling stronger global context modeling than CNN U‑Nets and achieving competitive ImageNet FID with comparable compute .

Key takeaways (non-text U-Net family)

Cascades separate global structure (base) from high-frequency detail (super‑res), scaling quality to high resolutions efficiently.

ViT backbones in U‑shaped layouts preserve inductive benefits of skip connections while capturing long-range dependencies.

These ideas later influenced text-to-image systems (e.g., two‑stage SDXL) even as the field transitioned toward DiT backbones.

The DiTs Era

As U-Net–based models began to hit a scaling ceiling (e.g., SDXL with ~2.6B parameters ), naive scaling proved ineffective, motivating a shift towards alternative backbones. The introduction of Diffusion Transformers (DiTs) marks a significant paradigm shift by recasting image generation as a patch-sequence modeling problem solved with transformer blocks. This approach offers several key advantages over U-Nets, including superior scalability via stacked DiT blocks, the ability to capture global context via self-attention for long-range dependencies, and a unified architecture that leverages advances in multimodal integration.

Model	Gen. (#Param)	Txt. (#Param)	Total (#Param)	Release Date
PixArt-$\alpha$	0.61B	4.76B	5.46B	2023/10/06
Lumina-T2I	~4.7B	~7B	~15B	2024/04/01
PixArt-$\Sigma$	0.61B	4.76B	5.46B	2024/04/11
Lumina-Next-T2I	1.75B	2.51B	4.34B	2024/05/12
Stable Diffusion 3	2.03B	5.58B	7.69B	2024/06/12
Flux.1-Dev	11.90B	4.88B	16.87B	2024/08/02
CogView3-Plus	2.85B	4.76B	8.02B	2024/10/13
Hunyuan-DiT	1.50B	2.02B	3.61B	2024/12/01
SANA	0.59B	2.61B	3.52B	2025/01/11
Lumina-Image 2.0	2.61B	2.61B	5.31B	2025/01/22
SANA 1.5	1.60B	2.61B	4.53B	2025/03/21
HiDream-I1-Dev	17.11B	5.58B	22.77B	2025/04/06
CogView4-6B	3.50B	2.00B	6.00B	2025/05/03
Qwen-Image	20.43B	8.29B	28.85B	2025/08/04

Latest Advancement in U-Net and DiT Architecture Design

While the transition from U-Net to DiT architectures represents a major paradigm shift, both architectural families have continued to evolve with innovative refinements. In the U-Net domain, two-stage cascaded approaches decompose generation into a low-resolution base model and specialized super-resolution upsamplers, improving fidelity while maintaining training tractability. U-ViT bridges U-Net and transformer architectures by replacing CNN residual blocks with Vision Transformer blocks while retaining the characteristic U-shaped structure and skip connections, enabling stronger global context modeling with competitive ImageNet performance.

“The DiT family has seen rapid advances across multiple dimensions. Architecture variants include SiT (Scalable Interpolant Transformer), which replaces diffusion with interpolant-based transport for improved stability, and LiT (Linear Diffusion Transformer) , which achieves O(n) complexity through linear attention mechanisms enabling higher-resolution generation. Training efficiency innovations such as MDT/MDTv2 and MaskDiT leverage masked latent modeling to achieve 10× faster learning and competitive performance with only 30% of standard training time, while representation-based approaches REPA (REPresentation Alignment) and REG (Representation Entanglement for Generation) incorporate external pretrained visual representations to dramatically accelerate training—REPA achieves 17.5× speedup and FID of 1.42, while REG achieves 63× faster training than baseline SiT by entangling image latents with class tokens during denoising with negligible inference overhead. Architecture refinements like DDT decouple semantic encoding from high-frequency decoding for 4× faster convergence. In parallel, U-DiTs downsample (and later upsample) tokens in a U-shaped DiT, shortening the effective sequence length to reduce attention cost while preserving fine detail for high-resolution synthesis . Meanwhile, tokenizer innovations leverage pretrained foundation models: RAE replaces standard VAEs with pretrained representation encoders (DINO, SigLIP, MAE) achieving FID of 1.13, and Aligned Visual Foundation Encoders employ a three-stage alignment strategy to transform foundation encoders into semantically rich tokenizers, accelerating convergence (gFID 1.90 in 64 epochs) and outperforming standard VAEs in text-to-image generation.”

Pre-trained Text-to-Image Checkpoints

The landscape of pre-trained text-to-image models has evolved dramatically since the introduction of Stable Diffusion. These models serve as powerful foundation models that can be adapted for specialized downstream tasks without architectural modifications, simply by fine-tuning on domain-specific datasets.

Interactive Architecture Explorer

U-Net Family

Stable Diffusion represents the pioneering work in latent diffusion models, adopting a U-Net architecture that operates in a compressed latent space rather than pixel space. This design choice dramatically reduces computational costs while maintaining high-quality generation capabilities. The model combines two key components: a pre-trained variational autoencoder (VAE) for efficient image compression and decompression, and a diffusion model that performs the denoising process in this latent space.In the prioring work of LDM in the paper , the VAE part is adopted a VQ-GAN style from . When it comes to CompVis Stable Diffusion v1.1-v.1.4 and StabilityAI Stable Diffusion v1.5 and v2.x version, the VAE part is turned to AutoEncoderKL style rather than a VQ style.

Stable Diffusion 1.x - 2.x architecture. Image Credit: .

Stable Diffusion XL (SDXL) marked a significant scaling advancement, adopting a two-stage U-Net architecture and increasing the model size from 0.8 billion to 2.6 billion parameters. SDXL remains one of the largest U-Net-based models for image generation and demonstrates improved efficiency and compatibility across diverse domains and tasks. Despite reaching scaling limits, SDXL continues to serve as a foundation for numerous specialized applications.

Kandinsky represents a significant advancement in the U-Net era, introducing a novel exploration of latent diffusion architecture that combines image prior models with latent diffusion techniques. The model features a modified MoVQ implementation as the image autoencoder component and achieves a FID score of 8.03 on the COCO-30K dataset, marking it as the top open-source performer in terms of measurable image generation quality. Kandinsky 3 continues this series with improved text understanding and domain-specific performance, presenting a multifunctional generative framework supporting text-guided inpainting/outpainting, image fusion, and image-to-video generation.

Stable Cascade (based on Würstchen architecture) introduces an efficient architecture for large-scale text-to-image diffusion models, achieving competitive performance with unprecedented cost-effectiveness. The key innovation is a latent diffusion technique that learns extremely compact semantic image representations, reducing computational requirements significantly—training requires only 24,602 A100-GPU hours compared to Stable Diffusion 2.1’s 200,000 GPU hours while maintaining state-of-the-art results.

UniDiffuser explores transformer-based diffusion models with a unified framework that fits all distributions relevant to multi-modal data in one model. While primarily focused on transformer architectures, this work demonstrates the potential for unified multi-modal generation within the diffusion framework.

PixArt-$\alpha$ (2023/10/06)

PixArt-$\alpha$ is motivated by the rising compute and environmental costs of text-to-image systems, seeking near-commercial quality with a much smaller training budget . In contrast to SD 1.5/2.1, it adopts a large-language-model text encoder (T5) , making it the first open-source diffusion T2I model to use an LLM-based text encoder while keeping the overall design streamlined.

Architecturally, PixArt-$\alpha$ is a latent Diffusion Transformer (DiT): VAE latents are patchified into a token sequence processed by stacked Transformer blocks; each block applies cross-attention to text tokens, and timestep conditioning is injected via a shared adaLN-single, simplifying parameters and conditioning pathways .

Key differences vs SD 1.5/2.1

Transformer sequence-of-patches backbone (no encoder–decoder or skip connections)
Shared adaLN for time and unified per-block cross-attention (vs U-Net residual blocks with per-block time MLP/spatial injections)
T5 text encoder (LLM) rather than CLIP/OpenCLIP

Cost Comparison — Comparisons of CO2 emissions and training cost among T2I generators. PIXART-α achieves an exceptionally low training cost of $28,400. Compared to RAPHAEL , our CO2 emissions and training costs are merely 1.2% and 0.91%, respectively. Image Credit: .

Pixart-α Architecture — Model architecture of PIXART-α. A cross-attention module is integrated into each block to inject textual conditions. To optimize efficiency, all blocks share the same adaLN-single parameters for time conditions. Image Credit: .

Lumina-T2I (2024/04/01)

Lumina-T2I is the first entry in the Lumina series from Shanghai AI Lab, aiming for a simple, scalable framework that supports flexible resolutions while maintaining photorealism. Building on the Sora insight that scaling Diffusion Transformers enables generation across arbitrary aspect ratios and durations yet lacks concrete implementation details, Lumina-T2I adopts flow matching to stabilize and accelerate training .

Architecturally, Lumina-T2I uses a Flow-based Large Diffusion Transformer (Flag-DiT) with zero-initialized attention, RoPE , and KQ-Norm . Latent features are tokenized and processed by Transformer blocks; learnable placeholders such as the [nextline] token and layerwise relative position injection enable robust resolution extrapolation without retraining for each size.

Key differences vs PixArt-α

Robust resolution generalization across 512²–1792²
Uses one-dimensional RoPE, [nextline] token, and layerwise relative position injection
PixArt-α uses absolute positional embeddings limited to the initial layer, degrading at out-of-distribution scales

Lumina-T2I Architecture — Lumina-T2I architecture featuring Flag-DiT backbone. Image Credit: .

PixArt-$\Sigma$ (2024/04/11)

PixArt-$\Sigma$ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-$\Sigma$’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Architecturally, PixArt-$\Sigma$ maintains the same DiT backbone as PixArt-$\alpha$ but introduces efficient token compression through a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. The model incorporates superior-quality image data paired with more precise and detailed image captions, along with data curriculum strategies for improved training effectiveness.

Key differences vs PixArt-α

Efficient token compression via novel attention module compressing keys and values
Superior-quality training data with more precise and detailed captions
Data curriculum strategies for improved training effectiveness
4K image generation capability for high-resolution content creation

Lumina-Next-T2I (2024/05/12)

Lumina-Next-T2I Architecture — Lumina-Next-T2I Next-DiT architecture. Image Credit: .

Lumina-Next-T2I targets the core limitations observed in Lumina-T2X—training instability, slow inference, and resolution extrapolation artifacts—by delivering stronger quality and faster sampling while improving zero-shot multilingual understanding. Unlike prior T2I works that rely on CLIP or T5 encoders , the Lumina series adopts decoder-only LLMs as text encoders: Lumina-T2X uses LLaMA-2 7B , whereas Lumina-Next employs the lighter Gemma-2B to reduce memory and increase throughput. In practice, Lumina-Next shows clear gains on multilingual prompts (vs. CLIP/T5 setups) and further improves text-image alignment with alternative LLMs like Qwen-1.8B and InternLM-7B.

Architecturally, Lumina-Next introduces the Next-DiT backbone with 3D RoPE and Frequency- and Time-Aware Scaled RoPE for robust resolution extrapolation . It adds sandwich normalizations to stabilize training (cf. normalization strategies such as KQ-Norm ), a sigmoid time discretization schedule to reduce Flow-ODE sampling steps, and a Context Drop mechanism that merges redundant visual tokens to accelerate inference—all while retaining the flow-based DiT formulation of the Lumina family.

Key differences vs Lumina-T2I

Next-DiT with 3D RoPE + frequency/time-aware scaling for stronger resolution extrapolation
Sandwich normalizations improve stability; sigmoid time schedule reduces sampling steps
Context Drop merges redundant tokens for faster inference throughput
Decoder-only LLM text encoders (Gemma-2B by default; Qwen-1.8B/InternLM-7B optional) boost zero-shot multilingual alignment vs CLIP/T5

Stable Diffusion 3 (2024/06/12)

Stable Diffusion 3 aims to improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales, demonstrating superior performance compared to established diffusion formulations for high-resolution text-to-image synthesis . This work presents the first comprehensive scaling study for text-to-image DiTs, establishing predictable scaling trends and correlating lower validation loss to improved synthesis quality across various metrics and human evaluations.

Architecturally, SD3 transitions from DiT’s cross-attention blocks to MMDiT (Multimodal Diffusion Transformer) with double-stream blocks that use separate weights for the two modalities, enabling bidirectional flow of information between image and text tokens for improved text comprehension and typography. Unlike SDXL which relies primarily on CLIP encoders, SD3 incorporates both CLIP (L/14 and OpenCLIP bigG/14) and T5-XXL encoders , concatenating pooled outputs and hidden representations to create comprehensive text conditioning with enhanced understanding capabilities.

Key differences vs SDXL and PixArt-α

MMDiT double-stream architecture with separate weights per modality and bidirectional information flow (vs single-stream cross-attention)
Integrated rectified flow training with perceptually-biased noise sampling (vs standard diffusion formulation)
Combined CLIP + T5-XXL text encoding for enhanced text comprehension and typography
First comprehensive scaling study demonstrating predictable trends for text-to-image DiTs

Stable Diffusion 3 Architecture — Stable Diffusion 3 MMDiT architecture. Image Credit: .

Simplied Architecture Illustration of Stable Diffusion 3.5 MM-DiT block. Image Source: Stability AI Blog.

Flux.1-Dev (2024/08/02)

Flux.1-Dev, developed by former Stability AI core members, aims to scale beyond previous models and achieve superior image quality with more accurate text-to-image synthesis . Representing a significant scaling effort, the model features a massive 12 billion parameter generator combined with a 4.7 billion parameter text encoder, marking substantial growth compared to predecessors and establishing new benchmarks in AI-driven image generation capabilities.

Architecturally, Flux.1-Dev advances beyond SD3’s MMDiT by implementing a hybrid architecture that combines both single-stream and double-stream Multi-Modal Diffusion Transformers, enhancing the model’s ability to process complex visual-textual relationships. Like SD3, it incorporates T5 text encoding and integrates rectified flow techniques for more stable and efficient training, while conducting a comprehensive scaling study that optimizes performance across the substantially larger parameter space.

Key differences vs SD3

Hybrid single-stream + double-stream MMDiT architecture (vs purely double-stream MMDiT)
Massive scaling to 12B generator + 4.7B text encoder parameters (vs smaller SD3 variants)
Enhanced rectified flow implementation optimized for larger scale training
Comprehensive scaling study specifically designed for multi-billion parameter DiTs

Flux.1-Dev Architecture — Flux.1-Dev MMDiT architecture. Image Credit: .

CogView3 & CogView3-Plus (2024/10/13)

CogView3 Architecture. — (left) The pipeline of CogView3. User prompts are rewritten by a text-expansion language model. The base stage model generates 512 × 512 images, and the second stage subsequently performs relaying super-resolution. (right) Formulation of relaying super-resolution in the latent space. Image Credit: .

CogView3 introduces a relay diffusion approach that generates low-resolution images first, then refines them through super-resolution to achieve 2048×2048 outputs. This multi-stage process reduces computational costs while improving quality—CogView3 outperformed SDXL by 77% in human evaluations while using only one-tenth the inference time. The model employs a text-expansion language model to rewrite user prompts, with a base stage generating 512×512 images followed by relaying super-resolution in the latent space.

CogView3-Plus upgrades to DiT architecture with Zero-SNR scheduling and joint text-image attention for further efficiency gains. This architectural evolution represents a significant step in the CogView series, transitioning from traditional approaches to transformer-based diffusion models while maintaining the efficiency advantages of the relay diffusion framework.

Hunyuan-DiT (2024/12/01)

Hunyuan-DiT, developed by Tencent’s Hunyuan team, aims to create a powerful multi-resolution diffusion transformer capable of fine-grained understanding of both English and Chinese languages, addressing the need for state-of-the-art Chinese-to-image generation with culturally relevant and multilingual capabilities . The model establishes a comprehensive data pipeline with iterative optimization, employing a Multimodal Large Language Model to refine image captions and enhance alignment between textual descriptions and generated images, particularly for intricate Chinese characters and cultural nuances.

Architecturally, Hunyuan-DiT builds upon PixArt-$\alpha$ by incorporating both single-stream and double-stream Multi-Modal Diffusion Transformer (MM-DiT) blocks similar to SD3, enabling efficient handling of complex image generation tasks across multiple resolutions. The model integrates dual text encoders—CLIP for understanding overall semantic content and T5 for nuanced language comprehension including complex sentence structures—combined with enhanced positional encoding to maintain spatial information across different resolutions, facilitating robust multi-resolution generation capabilities.

Key differences vs PixArt-$\alpha

Single-stream + double-stream MM-DiT blocks for enhanced multi-modal processing (vs single-stream cross-attention)
Dual text encoders (CLIP + T5) for semantic and nuanced language understanding (vs T5 only)
Multi-resolution diffusion transformer with enhanced positional encoding for robust resolution handling
Multimodal LLM-refined captions with fine-grained bilingual (English + Chinese) understanding

Hunyuan-DiT Architecture — Hunyuan-DiT multi-resolution architecture. Image Credit: .

SANA (2025/01/11)

SANA, developed by NVIDIA, aims to enable efficient high-resolution image synthesis up to 4096×4096 pixels while maintaining deployment feasibility on consumer hardware, generating 1024×1024 images in under a second on a 16GB laptop GPU . The model introduces innovations to reduce computational requirements dramatically: DC-AE (deep compression autoencoder) achieves 32× image compression reducing latent tokens significantly, efficient caption labeling and selection accelerate convergence, and Flow-DPM-Solver reduces sampling steps for faster generation.

Architecturally, SANA advances beyond PixArt-$\Sigma$ by replacing traditional self-attention mechanisms with Linear Diffusion Transformer (Linear DiT) blocks, enhancing computational efficiency at high resolutions without compromising quality. The model adopts a decoder-only small language model as the text encoder, employing complex human instructions with in-context learning to improve text-image alignment compared to conventional CLIP or T5 encoders. The compact 0.6B parameter model achieves competitive performance with substantially larger models like Flux-12B while being 20 times smaller and over 100 times faster in throughput.

Key differences vs PixArt-$\Sigma

Linear DiT replacing traditional self-attention for O(n) complexity vs O(n²) at high resolutions
DC-AE with 32× compression reducing latent tokens and memory requirements dramatically
Decoder-only language model as text encoder with in-context learning (vs T5)
0.6B parameters achieving competitive quality with 12B models while 100× faster throughput

SANA Architecture — SANA Linear DiT architecture for efficient high-resolution generation. Image Credit: .

Lumina-Image 2.0 (2025/01/22)

Lumina-Image 2.0 aims to provide a unified and efficient image generative framework that excels in generating high-quality images with strong text-image alignment across diverse generation and editing tasks . Building upon the Lumina series’ foundation, the model consolidates multiple generation tasks into a cohesive framework, optimizing performance and efficiency to cater to a wide range of image generation applications while achieving competitive scores across multiple benchmarks including FID and CLIP metrics.

Architecturally, Lumina-Image 2.0 advances beyond Lumina-Next-T2I by introducing a unified Next-DiT architecture that seamlessly integrates text-to-image generation and image editing capabilities within a shared framework. The model maintains the Lumina series’ architectural strengths including 3D RoPE , frequency-aware scaling, and flow-based formulation, while enhancing the framework to support both generation and editing operations efficiently. This unified approach enables the model to leverage shared representations and training strategies across different image generation modalities.

Key differences vs Lumina-Next-T2I

Unified Next-DiT framework seamlessly integrating generation and editing (vs generation-only focus)
Enhanced multi-task architecture supporting diverse image generation applications within single model
Optimized training paradigm leveraging shared representations across generation modalities
Competitive performance across FID and CLIP benchmarks with improved efficiency

Lumina-Image 2.0 Architecture — Lumina-Image 2.0 Unified Next-DiT architecture. Image Credit: .

SANA 1.5 (2025/03/21)

SANA 1.5 aims to push the boundaries of efficient high-resolution image synthesis established by SANA, offering improved performance and scalability through larger model sizes and advanced inference scaling techniques . The model introduces inference scaling via VISA (a specialized NVILA-2B model) that scores and selects top images from large candidate sets, significantly boosting GenEval performance scores—for instance, improving SANA-1.5-4.8B from 81 to 96. This approach demonstrates that post-generation selection can dramatically enhance quality metrics without architectural changes.

Architecturally, SANA 1.5 builds upon the original SANA by incorporating an enhanced DC-AE (deep compression autoencoder) to handle higher resolutions and more complex generation tasks, along with advanced Linear DiT blocks featuring more sophisticated linear attention mechanisms to boost efficiency and quality in high-resolution synthesis. The model scales to 4.8B parameters compared to SANA’s 0.6B, providing a robust solution for generating high-quality images with strong text-image alignment suitable for diverse professional applications requiring both quality and computational efficiency.

Key differences vs SANA

Inference scaling with VISA model for candidate selection dramatically improving GenEval scores (81→96)
Enhanced DC-AE handling higher resolutions and more complex generation tasks
Advanced Linear DiT with more sophisticated linear attention mechanisms
Scaled to 4.8B parameters providing improved quality while maintaining efficiency advantages

SANA 1.5 Architecture — SANA 1.5 improved Linear DiT architecture. Image Credit: .

HiDream-I1-Dev (2025/04/06)

HiDream-I1, developed by HiDream.ai, addresses the critical trade-off between quality improvements and computational complexity in image generative foundation models, aiming to achieve state-of-the-art image generation quality within seconds while maintaining high efficiency . With 17 billion parameters, the model introduces a sparse Diffusion Transformer structure that enables efficient inference suitable for professional-grade design needs, supporting 4K ultra-high-definition image generation with advanced text comprehension, multi-style adaptation, and precise detail control while optimizing computational requirements through sparsity.

Architecturally, HiDream-I1 advances beyond Flux.1-Dev and Qwen-Image by implementing a novel sparse DiT structure where only subsets of transformer blocks are activated for each forward pass, dramatically reducing computational costs while maintaining generation quality. The sparse architecture enables the massive 17B parameter model to achieve practical inference speeds comparable to smaller dense models, with efficient diffusion mechanisms supporting multimodal input and providing fine-grained control over generation. This sparse approach represents a paradigm shift in scaling DiT models, demonstrating that architectural efficiency through sparsity can rival quality of substantially denser models.

Key differences vs Flux.1-Dev and other large DiTs

Sparse DiT structure activating only subsets of blocks per forward pass for efficient 17B parameter model
4K ultra-high-definition generation support with optimized inference speed despite massive scale
Advanced sparse attention mechanisms maintaining quality while dramatically reducing computational costs
Multimodal input support and fine-grained control optimized for professional-grade design applications

HiDream-I1-Dev Architecture — HiDream-I1-Dev Sparse DiT architecture. Image Credit: .

CogView4-6B (2025/05/03)

CogView4-6B represents the latest advancement in the CogView series, featuring a sophisticated CogView4Transformer2DModel architecture that excels in Chinese text rendering and multilingual image generation. The model demonstrates exceptional performance in text accuracy evaluation, achieving precision of 0.6969, recall of 0.5532, and F1 score of 0.6168 on Chinese text benchmarks.

CogView4-6B leverages GLM-based text encoding and advanced transformer blocks with RoPE (Rotary Position Embedding) for enhanced spatial understanding and text-image alignment. This architectural sophistication enables the model to achieve superior text rendering capabilities, particularly for complex Chinese characters and multilingual content, setting new standards for text-to-image generation in non-Latin scripts. Available on Hugging Face under Apache 2.0 license.

Qwen-Image (2025/08/04)

Qwen-Image represents a monumental scaling achievement in text-to-image synthesis, establishing a new state-of-the-art with its massive 28.85 billion parameter architecture . Developed by Alibaba’s Qwen team, this flagship model aims to push the boundaries of generation quality, text-image alignment, and multimodal understanding through unprecedented scale. The model excels at generating highly detailed, photorealistic images that accurately reflect complex textual prompts, setting new benchmarks for fidelity and coherence in the field.

Qwen-Image Architecture — Qwen-Image massively scaled MMDiT architecture. Image Credit: .

Architecturally, Qwen-Image employs a massively scaled Multi-Modal Diffusion Transformer (MMDiT) that builds upon the hybrid single- and double-stream designs seen in models like Flux.1-Dev. The generator model alone comprises over 20 billion parameters, combined with a powerful 8.29 billion parameter text encoder for unparalleled language comprehension. This dual-stream approach allows for sophisticated interaction between text and image modalities, enabling precise control over generated content. The model integrates advanced training techniques, including rectified flow and large-scale data curation, to ensure stable and efficient convergence despite its enormous size.

Key differences vs HiDream-I1-Dev

Massive dense scaling to 28.85B parameters (vs HiDream's 17B sparse architecture)
Focus on state-of-the-art quality through sheer scale (vs HiDream's focus on efficiency via sparsity)
Extremely large 8.29B text encoder for superior text-image alignment
Represents the pinnacle of the dense DiT scaling paradigm before potential shifts to new architectures

Experiments and Case Studies

To comprehensively evaluate the capabilities of different text-to-image diffusion models, we propose a systematic evaluation framework spanning tasks of varying complexity. This section will present case studies of text-to-image generation visualizations using existing checkpoints, assessing their performance across a spectrum of increasingly challenging tasks.

Implementation DetailsFor commercial model, we use ChatGPT webui GPT-5-Instant with the same prompt for each case study for image generation with a default image size as 1024 × 1024:

Parameter	Value
Precision	bfloat16
Scheduler	default
Steps	50
Guidance Scale	7.5
Resolution	512×512

Summary of Results

There is no strong correlation between image model size and image aesthetics (See case study 4).
There is no strong correlation between text model size and prompt following (See case study 5).
Large models generally work better but always the case.
U-Nets based model perform comparativaly worse than DiTs in the similar model size, for instance, SDXL to SANA, Kandinsky-3 to CogView4.
StaleDiffusion 3.x continously trained on higher resolution (e.g., 1024px) tends to generate croped results.
Not all models are capable to dealing with multilingual prompt (see case study 2).
Commercial model such as GPT-Image model works extremely well in aesthetics, prompt following, counting, text rendering and spatial reasoning.

Why Scaling Favors Attention

As diffusion models scaled in data and compute, the active bottleneck shifted from local fidelity to global semantic alignment, and the community moved accordingly: from U-Nets that hard-wire translation equivariance via convolution to Diffusion Transformers that learn equivariances through self-attention. Let $\mathcal{C}^{\mathrm{conv}}_{G}$ be the class of translation-equivariant, finite-support Toeplitz operators (U-Net convolutional kernels) and $\mathcal{A}^{\mathrm{attn}}$ the class of self-attention kernels with relative positional structure (DiTs). Write $\sqsubseteq^{\mathrm{bias}}$ as “is a constrained instance of (via inductive-bias constraints)”.

\[\boxed{ \mathcal{C}^{\mathrm{conv}}_{G}\ \sqsubseteq^{\mathrm{bias}}\ \mathcal{A}^{\mathrm{attn}} }\]

In plain terms, convolution is a simplified, efficient expression of attention obtained by enforcing fixed translation symmetry, parameter tying, and locality; removing these constraints yields attention without a hard-coded translation prior, allowing DiTs to learn which symmetries and long-range relations matter at scale. This inclusion explains the empirical shift under modern hardware and datasets: attention strictly generalizes convolution while retaining it as an efficient special case, delivering smoother scaling laws and higher semantic “bandwidth” per denoising step. In practice, this is also a story of hardware path dependence: attention’s dense-matrix primitives align with contemporary accelerators and compiler stacks, effectively “winning” the hardware lottery . And, echoing the Bitter Lesson, as data and compute grow, general methods with fewer hand-engineered priors dominate—making attention’s strict generalization of convolution the natural backbone at scale.

Further Discussion

From Text-to-Image Generation to Real-World Applications

Text-to-image is now genuinely strong; the next wave is about conditioning existing pixels rather than generating from scratch—turning models into reliable editors that honor what must stay and change only what’s asked. This means prioritizing downstream tasks like image editing, inpainting/outpainting, image-to-image restyling, and structure- or reference-guided synthesis (edges, depth, layout, style, identity). The practical focus shifts from unconstrained novelty to controllable, faithful rewrites with tight mask adherence, robust subject/style preservation, and interactive latencies, so these systems plug cleanly into real creative, design, and industrial workflows.

Diffusion Models vs. Auto-regressive Models

Diffusion models and autoregressive (AR) models represent two fundamentally different approaches to image generation, with the key distinction being that autoregressive models operate on discrete image tokens while diffusion models work with continuous representations. Autoregressive models like DALL-E , CogView , and CogView2 treat image generation as a sequence modeling problem, encoding images into discrete tokens using VQ-VAE or similar vector quantization methods, then autoregressively predicting the next token given previous tokens. This approach offers sequential generation with precise control and natural language integration, but suffers from slow generation, error accumulation, and discrete representation loss. In contrast, diffusion models operate directly on continuous pixel or latent representations, learning to reverse a gradual noise corruption process, which enables parallel generation, high-quality outputs, and flexible conditioning, though at the cost of computational overhead and less direct control. Recent advances have significantly improved autoregressive approaches: VAR redefines autoregressive learning as coarse-to-fine “next-scale prediction” and achieves superior performance compared to diffusion transformers, while Infinity demonstrates effective scaling of bitwise autoregressive modeling for high-resolution synthesis. Additionally, MAR bridges the gap between paradigms by adopting diffusion loss for autoregressive models, enabling continuous-valued autoregressive generation without vector quantization. Recent work has also explored hybrid approaches that combine both paradigms: HunyuanImage 3.0 and BLIP3-o demonstrate unified multimodal models within autoregressive frameworks while incorporating diffusion-inspired techniques, while OmniGen and OmniGen2 use diffusion models as backbones for unified generation capabilities.