HunyuanVideo-1.5
HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.
You can find all the original HunyuanVideo checkpoints under the Tencent organization.
[!TIP]
Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.
The example below demonstrates how to generate a video optimized for memory or inference speed.
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
```py
import torch
from diffusers import AutoModel, HunyuanVideo15Pipeline
from diffusers.utils import export_to_video
pipeline = HunyuanVideo15Pipeline.from_pretrained(
"HunyuanVideo-1.5-Diffusers-480p_t2v",
torch_dtype=torch.bfloat16,
)
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)
```
## Notes
- HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
- **H100/H800:** `_flash_3_hub` or `_flash_3_varlen_hub`
- **A100/A800/RTX 4090:** `flash_hub` or `flash_varlen_hub`
- **Other GPUs:** `sage_hub`
Refer to the [Attention backends](../../optimization/attention_backends) guide for more details about using a different backend.
```py
pipe.transformer.set_attention_backend("flash_hub") # or your preferred backend
```
- [`HunyuanVideo15Pipeline`] use guider and does not take `guidance_scale` parameter at runtime.
You can check the default guider configuration using `pipe.guider`:
```py
>>> pipe.guider
ClassifierFreeGuidance {
"_class_name": "ClassifierFreeGuidance",
"_diffusers_version": "0.36.0.dev0",
"enabled": true,
"guidance_rescale": 0.0,
"guidance_scale": 6.0,
"start": 0.0,
"stop": 1.0,
"use_original_formulation": false
}
State:
step: None
num_inference_steps: None
timestep: None
count_prepared: 0
enabled: True
num_conditions: 2
```
To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)`
```py
pipe.guider = pipe.guider.new(guidance_scale=5.0)
```
Read more on Guider [here](../../modular_diffusers/guiders).
## HunyuanVideo15Pipeline
[[autodoc]] HunyuanVideo15Pipeline
- all
- __call__
## HunyuanVideo15ImageToVideoPipeline
[[autodoc]] HunyuanVideo15ImageToVideoPipeline
- all
- __call__
## HunyuanVideo15PipelineOutput
[[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput