About LLM Compressor

LLM Compressor is an easy-to-use library for optimizing large language models for deployment with vLLM, enabling up to 5X faster, cheaper inference. It provides a comprehensive toolkit for:

Applying a wide variety of compression algorithms, including weight and activation quantization, pruning, and more
Seamlessly integrating with Hugging Face Transformers, Models, and Datasets
Using a safetensors-based file format for compressed model storage that is compatible with vLLM
Supporting performant compression of large models via accelerate

LLM Compressor

LLM Compressor Flow

Recent Updates

Llama4 Quantization Support

Quantize a Llama4 model to W4A16 or NVFP4. The checkpoint produced can seamlessly run in vLLM.

Large Model Support with Sequential Onloading

As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see Big Modeling with Sequential Onloading as well as the DeepSeek-R1 Example.

Preliminary FP4 Quantization Support

Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 configuration. See examples of weight-only quantization and fp4 activation support. Support is currently preliminary and additional support will be added for MoEs.

Updated AWQ Support

Improved support for MoEs with better handling of larger models

Axolotl Sparse Finetuning Integration

Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create fast sparse open-source models with Axolotl and LLM Compressor. See also the Axolotl integration docs.

For more information, check out the latest release on GitHub.

Key Features

Weight and Activation Quantization: Reduce model size and improve inference performance for general and server-based applications with the latest research.
- Supported Algorithms: GPTQ, AWQ, SmoothQuant, RTN
- Supported Formats: INT W8A8, FP W8A8
Weight-Only Quantization: Reduce model size and improve inference performance for latency sensitive applications with the latest research
- Supported Algorithms: GPTQ, AWQ, RTN
- Supported Formats: INT W4A16, INT W8A16
Weight Pruning: Reduce model size and improve inference performance for all use cases with the latest research
- Supported Algorithms: SparseGPT, Magnitude, Sparse Finetuning
- Supported Formats: 2:4 (semi-structured), unstructured

Key Sections

Getting Started

Install LLM Compressor and learn how to apply your first optimization recipe.

Getting started
Guides

Detailed guides covering compression schemes, algorithms, and advanced usage patterns.

Guides
Examples

Step-by-step examples for different compression techniques and model types.

Examples
Developer Resources

Information for contributors and developers extending LLM Compressor.

Developer Resources