Asynchronous server and parallel execution of models

Example/demo server that keeps a single model in memory while safely running parallel inference requests by creating per-request lightweight views and cloning only small, stateful components (schedulers, RNG state, small mutable attrs). Works with StableDiffusion3 pipelines. We recommend running 10 to 50 inferences in parallel for optimal performance, averaging between 25 and 30 seconds to 1 minute and 1 minute and 30 seconds. (This is only recommended if you have a GPU with 35GB of VRAM or more; otherwise, keep it to one or two inferences in parallel to avoid decoding or saving errors due to memory shortages.)

⚠️ IMPORTANT

Necessary components

All the components needed to create the inference server are in the current directory:

server-async/
├── utils/
├─────── __init__.py
├─────── scheduler.py              # BaseAsyncScheduler wrapper and async_retrieve_timesteps for secure inferences
├─────── requestscopedpipeline.py  # RequestScoped Pipeline for inference with a single in-memory model
├─────── utils.py                  # Image/video saving utilities and service configuration
├── Pipelines.py                   # pipeline loader classes (SD3)
├── serverasync.py                 # FastAPI app with lifespan management and async inference endpoints
├── test.py                        # Client test script for inference requests
├── requirements.txt               # Dependencies
└── README.md                      # This documentation

What diffusers-async adds / Why we needed it

Core problem: a naive server that calls pipe.__call__ concurrently can hit race conditions (e.g., scheduler.set_timesteps mutates shared state) or explode memory by deep-copying the whole pipeline per-request.

diffusers-async / this example addresses that by:

How the server works (high-level flow)

  1. Single model instance is loaded into memory (GPU/MPS) when the server starts.
  2. On each HTTP inference request:

  3. Result: inference completes, images are moved to CPU & saved (if requested), internal buffers freed (GC + torch.cuda.empty_cache()).
  4. Multiple requests can run in parallel while sharing heavy weights and isolating mutable state.

How to set up and run the server

1) Install dependencies

Recommended: create a virtualenv / conda environment.

pip install diffusers
pip install -r requirements.txt

2) Start the server

Using the serverasync.py file that already has everything you need:

python serverasync.py

The server will start on http://localhost:8500 by default with the following features:

3) Test the server

Use the included test script:

python test.py

Or send a manual request:

POST /api/diffusers/inference with JSON body:

{
  "prompt": "A futuristic cityscape, vibrant colors",
  "num_inference_steps": 30,
  "num_images_per_prompt": 1
}

Response example:

{
  "response": ["http://localhost:8500/images/img123.png"]
}

4) Server endpoints

Advanced Configuration

RequestScopedPipeline Parameters

RequestScopedPipeline(
    pipeline,                        # Base pipeline to wrap
    mutable_attrs=None,             # Custom list of attributes to clone
    auto_detect_mutables=True,      # Enable automatic detection of mutable attributes
    tensor_numel_threshold=1_000_000, # Tensor size threshold for cloning
    tokenizer_lock=None,            # Custom threading lock for tokenizers
    wrap_scheduler=True             # Auto-wrap scheduler in BaseAsyncScheduler
)

BaseAsyncScheduler Features

Server Configuration

The server configuration can be modified in serverasync.py through the ServerConfigModels dataclass:

@dataclass
class ServerConfigModels:
    model: str = 'stabilityai/stable-diffusion-3.5-medium'  
    type_models: str = 't2im'  
    host: str = '0.0.0.0' 
    port: int = 8500

Troubleshooting (quick)