mirror of https://github.com/huggingface/transformers.git synced 2025-07-05 13:50:13 +06:00

Enable BNB multi-backend support (#31098 )

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6d6b.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

2024-09-24 03:40:56 -06:00

7.2 KiB

Raw Blame History

Quantization

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Interested in adding a new quantization method to Transformers? Read the HfQuantizer guide to learn how!

If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:

When to use what?

The community has developed many quantization methods for various use cases. With Transformers, you can run any of these integrated methods depending on your use case because each method has their own pros and cons.

For example, some quantization methods require calibrating the model with a dataset for more accurate and "extreme" compression (up to 1-2 bits quantization), while other methods work out of the box with on-the-fly quantization.

Another parameter to consider is compatibility with your target device. Do you want to quantize on a CPU, GPU, or Apple silicon?

In short, supporting a wide range of quantization methods allows you to pick the best quantization method for your specific use case.

Use the table below to help you decide which quantization method to use.

Quantization method	On the fly quantization	CPU	CUDA GPU	RoCm GPU (AMD)	Metal (Apple Silicon)	torch.compile() support	Number of bits	Supports fine-tuning (through PEFT)	Serializable with 🤗 transformers	🤗 transformers support	Link to library
AQLM	🔴	🟢	🟢	🔴	🔴	🟢	1 / 2	🟢	🟢	🟢	https://github.com/Vahe1994/AQLM
AWQ	🔴	🔴	🟢	🟢	🔴	?	4	🟢	🟢	🟢	https://github.com/casper-hansen/AutoAWQ
bitsandbytes	🟢	🟡 *	🟢	🟡 *	🔴 **	🔴 (soon!)	4 / 8	🟢	🟢	🟢	https://github.com/bitsandbytes-foundation/bitsandbytes
EETQ	🟢	🔴	🟢	🔴	🔴	?	8	🟢	🟢	🟢	https://github.com/NetEase-FuXi/EETQ
GGUF / GGML (llama.cpp)	🟢	🟢	🟢	🔴	🟢	🔴	1 - 8	🔴	See GGUF section	See GGUF section	https://github.com/ggerganov/llama.cpp
GPTQ	🔴	🔴	🟢	🟢	🔴	🔴	2 - 3 - 4 - 8	🟢	🟢	🟢	https://github.com/AutoGPTQ/AutoGPTQ
HQQ	🟢	🟢	🟢	🔴	🔴	🟢	1 - 8	🟢	🔴	🟢	https://github.com/mobiusml/hqq/
Quanto	🟢	🟢	🟢	🔴	🟢	🟢	2 / 4 / 8	🔴	🔴	🟢	https://github.com/huggingface/quanto
FBGEMM_FP8	🟢	🔴	🟢	🔴	🔴	🔴	8	🔴	🟢	🟢	https://github.com/pytorch/FBGEMM
torchao	🟢		🟢	🔴	partial support (int4 weight only)		4 / 8		🟢🔴	🟢	https://github.com/pytorch/ao

* bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit this link.

We value your feedback to help identify bugs before the full release! Check out these docs for more details and feedback links.

** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.

7.2 KiB Raw Blame History

Quantization

When to use what?

7.2 KiB

Raw Blame History