transformers/docs/source/en/quantization/finegrained_fp8.md
Mohamed Mekkouri b41591d847
Fix : fix doc fp8 (#36173)
* fix

* fix
2025-02-13 15:29:59 +01:00

3.0 KiB

Fine-grained FP8

With FP8 quantization method, you can quantize your model in FP8 (W8A8):

  • the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
  • Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)

It's implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper here, and the image below explains the quantization scheme :

Tip

You need a GPU with compute capability>=9 (e.g. H100)

Before you begin, make sure the following libraries are installed with their latest version:

pip install --upgrade accelerate torch

Tip

You need to install a torch version compatible with the cuda version of your GPU.

By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set torch_dtype="auto" to load the weights in the data type defined in a model's config.json file to automatically load the most memory-optimal data type.

from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FineGrainedFP8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")