编译和卸载量化模型

优化模型通常涉及推理速度内存使用之间的权衡。例如,虽然缓存可以提高推理速度,但它也会增加内存消耗,因为它需要存储中间注意力层的输出。一种更平衡的优化策略结合了量化模型、torch.compile 和各种卸载方法

[!TIP] 查看 torch.compile 指南以了解更多关于编译以及如何在此处应用的信息。例如,区域编译可以显著减少编译时间,而不会放弃任何加速。

对于图像生成,结合量化和模型卸载通常可以在质量、速度和内存之间提供最佳权衡。组卸载对于图像生成效果不佳,因为如果计算内核更快完成,通常不可能完全重叠数据传输。这会导致 CPU 和 GPU 之间的一些通信开销。

对于视频生成,结合量化和组卸载往往更好,因为视频模型更受计算限制。

下表提供了优化策略组合及其对 Flux 延迟和内存使用的影响的比较。

| 组合 | 延迟 (s) | 内存使用 (GB) | |—|—|—| | 量化 | 32.602 | 14.9453 | | 量化, torch.compile | 25.847 | 14.9448 | | 量化, torch.compile, 模型 CPU 卸载 | 32.312 | 12.2369 | 这些结果是在 Flux 上使用 RTX 4090 进行基准测试的。transformer 和 text_encoder 组件已量化。如果您有兴趣评估自己的模型,请参考基准测试脚本

本指南将向您展示如何使用 bitsandbytes 编译和卸载量化模型。确保您正在使用 PyTorch nightly 和最新版本的 bitsandbytes。

pip install -U bitsandbytes

量化和 torch.compile

首先通过量化模型来减少存储所需的内存,并编译它以加速推理。

配置 Dynamo capture_dynamic_output_shape_ops = True 以在编译 bitsandbytes 模型时处理动态输出。

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# 量化
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# 编译
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer.compile(mode="max-autotune", fullgraph=True)
pipeline("""
    cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
    highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
).images[0]

量化、torch.compile 和卸载

除了量化和 torch.compile,如果您需要进一步减少内存使用,可以尝试卸载。卸载根据需要将各种层或模型组件从 CPU 移动到 GPU 进行计算。

在卸载期间配置 Dynamo cache_size_limit 以避免过多的重新编译,并设置 capture_dynamic_output_shape_ops = True 以在编译 bitsandbytes 模型时处理动态输出。

[模型 CPU 卸载](./memory#model-offloading) 将单个管道组件(如 transformer 模型)在需要计算时移动到 GPU。否则,它会被卸载到 CPU。 ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig torch._dynamo.config.cache_size_limit = 1000 torch._dynamo.config.capture_dynamic_output_shape_ops = True # 量化 pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder_2"], ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") # 模型 CPU 卸载 pipeline.enable_model_cpu_offload() # 编译 pipeline.transformer.compile() pipeline( "cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain" ).images[0] ``` [组卸载](./memory#group-offloading) 将单个管道组件(如变换器模型)的内部层移动到 GPU 进行计算,并在不需要时将其卸载。同时,它使用 [CUDA 流](./memory#cuda-stream) 功能来预取下一层以执行。 通过重叠计算和数据传输,它比模型 CPU 卸载更快,同时还能节省内存。 ```py # pip install ftfy import torch from diffusers import AutoModel, DiffusionPipeline from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video from diffusers.quantizers import PipelineQuantizationConfig from transformers import UMT5EncoderModel torch._dynamo.config.cache_size_limit = 1000 torch._dynamo.config.capture_dynamic_output_shape_ops = True # 量化 pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder"], ) text_encoder = UMT5EncoderModel.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16 ) pipeline = DiffusionPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") # 组卸载 onload_device = torch.device("cuda") offload_device = torch.device("cpu") pipeline.transformer.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) pipeline.vae.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) apply_group_offloading( pipeline.text_encoder, onload_device=onload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) # 编译 pipeline.transformer.compile() prompt = """ The camera rushes from far to near in a low-angle shot, revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=81, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=16) ```