Image Generation Case Study

About This Project

This repository provides a comprehensive case study of existing open-source diffusion models' capabilities in text-to-image generation and image editing. It offers a unified interface for comparing and experimenting with 14 state-of-the-art text-to-image models, along with support for closed-source API services.

Whether you're a researcher exploring the latest in generative AI, a developer integrating image generation into your applications, or an enthusiast experimenting with different models, this project provides an easy-to-use platform for text-to-image generation with comprehensive model support and flexible deployment options.

Key Features

🎨 Multi-model Comparison: Generate images with multiple models simultaneously for side-by-side comparison
🔓 14 Open-Source Models: Including Stable Diffusion variants, FLUX.1, SDXL, CogView, PixArt, and more
🔒 Closed-Source API Integration: Support for OpenAI DALL-E, Google Imagen, Bytedance Cloud, and Kling AI
🖥️ Gradio Web UI: User-friendly interface for interactive image generation
⚙️ Configurable Parameters: Full control over inference steps, guidance scale, image size, and seed
💾 Auto-save Organization: Images automatically saved with timestamp folders and generation config JSON
🚀 Multi-GPU Support: Automatic device mapping for utilizing multiple GPUs efficiently
📊 Memory Efficient: Sequential generation to manage VRAM usage

Supported Models

Open-Source Text-to-Image Models (14 Total)

From lightweight 3GB models to state-of-the-art 16GB models:

⚡ Fast & Efficient

Stable Diffusion 2.1 (~4 GB) - Classic, reliable
PixArt-XL 2 (~4 GB) - Fast generation
Sana 600M (~3 GB) - Lightweight

🎯 High Quality

Stable Diffusion XL (~7 GB) - Higher quality
FLUX.1 Dev (~16 GB) - State-of-the-art
Stable Diffusion 3 (~9 GB) - Latest SD3

🌏 Multilingual

CogView3 Plus 3B (~6 GB) - Multilingual
CogView4 6B (~11 GB) - Latest CogView
HunyuanDiT v1.2 (~10 GB) - Chinese + English

🔬 Specialized

Stable Cascade (~10 GB) - Multi-stage
Qwen Image (~8 GB) - Multimodal
UniDiffuser v1 (~5 GB) - Unified model

Closed-Source API Services

OpenAI DALL-E: DALL-E 2 & DALL-E 3 with quality and style controls (up to 1792x1792)
Google Imagen: Vertex AI Imagen for photorealistic generation (up to 1536x1536)
Bytedance Cloud: Volcano Engine text-to-image API (up to 2048x2048)
Kling AI: High-quality generation models (up to 2048x2048)

Quick Start

Installation

# Clone the repository
git clone https://github.com/Bili-Sakura/image-generation-case-study.git
cd image-generation-case-study

# Install dependencies
pip install -r requirements.txt

# Optional: Install API dependencies for closed-source models
pip install -r requirements_api.txt

Usage

Option 1: Gradio Web UI (Recommended)

python run.py

This will open a web browser at http://localhost:7860 with an intuitive UI for text-to-image generation.

Option 2: Python API

from src.model_manager import get_model_manager
from src.inference import generate_image

# Load model
manager = get_model_manager()
manager.load_model("stabilityai/stable-diffusion-2-1")

# Generate
image, filepath, seed = generate_image(
    model_id="stabilityai/stable-diffusion-2-1",
    prompt="A fantasy landscape with mountains and rivers",
    num_inference_steps=50,
    guidance_scale=7.5,
    seed=42
)

Generation Parameters

Inference Steps: 10-100 (default: 50) - More steps = higher quality but slower
Guidance Scale: 1.0-20.0 (default: 7.5) - Higher values = stronger prompt adherence
Image Sizes: 512px to 1280px with multiple presets
Seed Control: Fixed seed for reproducibility or random (-1)
Negative Prompts: Supported on compatible models

Citation

If you find this repository useful, please cite it as:

@misc{bili_sakura_image_generation_case_study,
  author       = {Bili-Sakura},
  title        = {Image Generation Case Study},
  year         = {2025},
  howpublished = {\url{https://github.com/Bili-Sakura/image-generation-case-study}}
}