transformers/docs/source/en/quantization/fbgemm_fp8.md
Marc Sun 96a074fa7e
Add new quant method (#32047)
* Add new quant method

* update

* fix multi-device

* add test

* add offload

* style

* style

* add simple example

* initial doc

* docstring

* style again

* works ?

* better docs

* switch to non persistant

* remove print

* fix init

* code review
2024-07-22 20:21:59 +02:00

2.6 KiB

FBGEMM FP8

With FBGEMM FP8 quantization method, you can quantize your model in FP8 (W8A8):

  • the weights will be quantized in 8bit (FP8) per channel
  • the activation will be quantized in 8bit (FP8) per token

It relies on the FBGEMM library which provides efficient low-precision general matrix multiplication for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization.

Tip

You need a GPU with compute capability>=9 (e.g. H100)

Before you begin, make sure the following libraries are installed with their latest version:

pip install --upgrade accelerate fbgemm-gpu torch

If you are having issues with fbgemm-gpu and torch library, you might need to install the nighlty release. You can follow the instruction here

from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FbgemmFp8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")