fix gptq nits (#25500)

* fix nits * fix docstring * fix doc * fix damp_percent * fix doc
2025-08-02 03:01:07 +06:00 · 2023-08-14 11:43:38 -04:00 · 2023-08-14 11:43:38 -04:00 · 06a1d75bd5
commit 06a1d75bd5
parent 80f29a25a7
2 changed files with 12 additions and 11 deletions
--- a/docs/source/en/main_classes/quantization.md
+++ b/docs/source/en/main_classes/quantization.md
@ -18,11 +18,11 @@ rendered properly in your Markdown viewer.

 ## `AutoGPTQ` Integration

-🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8,6,4 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.
+🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8, 4, 3 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.

 To learn more about the the quantization model, check out: 
 - the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
-<!-- - the `optimum` [guide]() on GPTQ quantization -->
+- the `optimum` [guide](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) on GPTQ quantization
 - the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend

 ### Requirements
@ -40,11 +40,12 @@ You need to have the following requirements installed to run the code below:

 - Install latest `accelerate` library 
 `pip install --upgrade accelerate`
-GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.
+
+Note that GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.

 ### Load and quantize a model

-GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~10 min on a Google colab for `facebook/opt-350m` model. 
+GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~5 min on a Google colab for `facebook/opt-350m` model). 

 Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it.
 #### GPTQ Configuration
--- a/src/transformers/utils/quantization_config.py
+++ b/src/transformers/utils/quantization_config.py
@ -317,9 +317,9 @@ class GPTQConfig(QuantizationConfigMixin):
            original datasets used in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new']
        group_size (`int`, *optional*, defaults to 128):
            The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
-        damp_percent (`float`, *optional*, defaults to 0.01):
-            The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.01.
-        desc_act (`bool`, *optional*, defaults to `True`):
+        damp_percent (`float`, *optional*, defaults to 0.1):
+            The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.1.
+        desc_act (`bool`, *optional*, defaults to `False`):
            Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly
            speed up inference but the perplexity may become slightly worse. Also known as act-order.
        sym (`bool`, *optional*, defaults to `True`):
@ -350,8 +350,8 @@ class GPTQConfig(QuantizationConfigMixin):
        tokenizer: Any = None,
        dataset: Optional[Union[List[str], str]] = None,
        group_size: int = 128,
-        damp_percent: float = 0.01,
-        desc_act: bool = True,
+        damp_percent: float = 0.1,
+        desc_act: bool = False,
        sym: bool = True,
        true_sequential: bool = True,
        use_cuda_fp16: bool = False,
@ -391,8 +391,8 @@ class GPTQConfig(QuantizationConfigMixin):
        r"""
        Safety checker that arguments are correct
        """
-        if self.bits not in [2, 4, 6, 8]:
-            raise ValueError(f"Only support quantization to [2,4,6,8] bits but found {self.bits}")
+        if self.bits not in [2, 3, 4, 8]:
+            raise ValueError(f"Only support quantization to [2,3,4,8] bits but found {self.bits}")
        if self.group_size != -1 and self.group_size <= 0:
            raise ValueError("group_size must be greater than 0 or equal to -1")
        if not (0 < self.damp_percent < 1):