transformers/docs/source/en/quantization/overview.md
Benjamin Fineran 574a9e12bb
HFQuantizer implementation for compressed-tensors library (#31704)
* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-25 14:31:38 +02:00

7.5 KiB

Quantization

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Interested in adding a new quantization method to Transformers? Read the HfQuantizer guide to learn how!

If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:

When to use what?

The community has developed many quantization methods for various use cases. With Transformers, you can run any of these integrated methods depending on your use case because each method has their own pros and cons.

For example, some quantization methods require calibrating the model with a dataset for more accurate and "extreme" compression (up to 1-2 bits quantization), while other methods work out of the box with on-the-fly quantization.

Another parameter to consider is compatibility with your target device. Do you want to quantize on a CPU, GPU, or Apple silicon?

In short, supporting a wide range of quantization methods allows you to pick the best quantization method for your specific use case.

Use the table below to help you decide which quantization method to use.

Quantization method On the fly quantization CPU CUDA GPU RoCm GPU (AMD) Metal (Apple Silicon) torch.compile() support Number of bits Supports fine-tuning (through PEFT) Serializable with 🤗 transformers 🤗 transformers support Link to library
AQLM 🔴 🟢 🟢 🔴 🔴 🟢 1 / 2 🟢 🟢 🟢 https://github.com/Vahe1994/AQLM
AWQ 🔴 🔴 🟢 🟢 🔴 ? 4 🟢 🟢 🟢 https://github.com/casper-hansen/AutoAWQ
bitsandbytes 🟢 🟡 * 🟢 🟡 * 🔴 ** 🔴 (soon!) 4 / 8 🟢 🟢 🟢 https://github.com/bitsandbytes-foundation/bitsandbytes
compressed-tensors 🔴 🟢 🟢 🟢 🔴 🔴 1 - 8 🟢 🟢 🟢 https://github.com/neuralmagic/compressed-tensors
EETQ 🟢 🔴 🟢 🔴 🔴 ? 8 🟢 🟢 🟢 https://github.com/NetEase-FuXi/EETQ
GGUF / GGML (llama.cpp) 🟢 🟢 🟢 🔴 🟢 🔴 1 - 8 🔴 See GGUF section See GGUF section https://github.com/ggerganov/llama.cpp
GPTQ 🔴 🔴 🟢 🟢 🔴 🔴 2 - 3 - 4 - 8 🟢 🟢 🟢 https://github.com/AutoGPTQ/AutoGPTQ
HQQ 🟢 🟢 🟢 🔴 🔴 🟢 1 - 8 🟢 🔴 🟢 https://github.com/mobiusml/hqq/
Quanto 🟢 🟢 🟢 🔴 🟢 🟢 2 / 4 / 8 🔴 🔴 🟢 https://github.com/huggingface/quanto
FBGEMM_FP8 🟢 🔴 🟢 🔴 🔴 🔴 8 🔴 🟢 🟢 https://github.com/pytorch/FBGEMM
torchao 🟢 🟢 🔴 partial support (int4 weight only) 4 / 8 🟢🔴 🟢 https://github.com/pytorch/ao

* bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit this link.

We value your feedback to help identify bugs before the full release! Check out these docs for more details and feedback links.

** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.