Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference.
[!TIP] Learn how to quantize models in the Quantization guide.
[[autodoc]] quantizers.PipelineQuantizationConfig
[[autodoc]] quantizers.quantization_config.BitsAndBytesConfig
[[autodoc]] quantizers.quantization_config.GGUFQuantizationConfig
[[autodoc]] quantizers.quantization_config.QuantoConfig
[[autodoc]] quantizers.quantization_config.TorchAoConfig
[[autodoc]] quantizers.base.DiffusersQuantizer