* Support `AOPerModuleConfig` and include_embedding
Summary:
This PR adds support per module configuration for torchao
Also added per module quantization examples:
1. Quantizing different layers with different quantization configs
2. Skip quantization for certain layers
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding
python tests/quantization/torchao_integration/test_torchao.py -k test_per_module_config_skip
Reviewers:
Subscribers:
Tasks:
Tags:
* format
* format
* inlcude embedding remove input embedding from module not to convert
* more docs
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Restructure torchao quantization examples
Summary:
Mainly structured the examples by hardwares and then listed
the recommended quantization methods for each hardware H100 GPU, A100 GPU and CPU
Also added example for push_to_hub
Test Plan:
not required
Reviewers:
Subscribers:
Tasks:
Tags:
* update
* drop float8 cpu
* address comments and simplify
* small update
* link update
* minor update
* refactor docs
* add serialization
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* reorder
* add link
* change automatic to autoquant
Co-authored-by: DerekLiu35 <91234588+DerekLiu35@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* nits
* refactor
* add colab
* update
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: DerekLiu35 <91234588+DerekLiu35@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* initial draft
* make documentation simpler
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/selecting.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* turn pros and cons into tables
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* add links to each quant method page
* separate calibration vs no calibration methods
* add calibration time estimates
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Disable inductor config setter by default
This is hard to debug and should be off by default
* remove default settings in autoquant too
* Add info to torchao.md about recommended settings
* satisfying Ruff format
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* update awq doc
* Update docs/source/en/quantization/awq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/awq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/awq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/quantization/awq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* add note for inference
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update torchao.md: use auto-compilation
* Update torchao.md: indicate updating transformers to the latest
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* rebasing changes
* fixing style
* adding some doc to functions
* remove bitblas
* change dtype
* fixing check_code_quality
* fixing import order
* adding doc to tree
* Small update on BitLinear
* adding some tests
* sorting imports
* small update
* reformatting
* reformatting
* reformatting with ruff
* adding assert
* changes after review
* update disk offloading
* adapting after review
* Update after review
* add is_serializable back
* fixing style
* adding serialization test
* make style
* small updates after review
* enable cpu awq ipex linear
* add doc for cpu awq with ipex kernel
* add tests for cpu awq
* fix code style
* fix doc and tests
* Update docs/source/en/quantization/awq.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update tests/quantization/autoawq/test_awq.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* fix comments
* fix log
* fix log
* fix style
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* HQQ model serialization attempt
* fix hqq dispatch and unexpected keys
* style
* remove check_old_param
* revert to check HQQLinear in quantizer_hqq.py
* revert to check HQQLinear in quantizer_hqq.py
* update HqqConfig default params
* make ci happy
* make ci happy
* revert to HQQLinear check in quantizer_hqq.py
* check hqq_min version 0.2.0
* set axis=1 as default in quantization_config.py
* validate_env with hqq>=0.2.0 version message
* deprecated hqq kwargs message
* make ci happy
* remove run_expected_keys_check hack + bump to 0.2.1 min hqq version
* fix unexpected_keys hqq update
* add pre_quantized check
* add update_expected_keys to base quantizerr
* ci base.py fix?
* ci base.py fix?
* fix "quantization typo" src/transformers/utils/quantization_config.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix post merge
---------
Co-authored-by: Marc Sun <marc@huggingface.co>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Enable non-safetensor serialization and deserialization for TorchAoConfig quantized model
Summary:
After https://github.com/huggingface/huggingface_hub/pull/2440 we added non-safetensor serialization and deserialization
in huggingface, with this we can now add the support in transformers
Note that we don't plan to add safetensor serialization due to different goals of wrapper tensor subclass and safetensor
see README for more details
Test Plan:
tested locally
Reviewers:
Subscribers:
Tasks:
Tags:
* formatting
* formatting
* minor fix
* formatting
* address comments
* comments
* minor fix
* update doc
* refactor compressed tensor quantizer
* Add compressed-tensors HFQuantizer implementation
* flag serializable as False
* run
* revive lines deleted by ruff
* fixes to load+save from sparseml, edit config to quantization_config, and load back
* address satrat comment
* compressed_tensors to compressed-tensors and revert back is_serializable
* rename quant_method from sparseml to compressed-tensors
* tests
* edit tests
* clean up tests
* make style
* cleanup
* cleanup
* add test skip for when compressed tensors is not installed
* remove pydantic import + style
* delay torch import in test
* initial docs
* update main init for compressed tensors config
* make fix-copies
* docstring
* remove fill_docstring
* Apply suggestions from code review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* review comments
* review comments
* comments - suppress warnings on state dict load, tests, fixes
* bug-fix - remove unnecessary call to apply quant lifecycle
* run_compressed compatability
* revert changes not needed for compression
* no longer need unexpected keys fn
* unexpected keys not needed either
* Apply suggestions from code review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* add to_diff_dict
* update docs and expand testing
* Update _toctree.yml with compressed-tensors
* Update src/transformers/utils/quantization_config.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* update doc
* add note about saving a loaded model
---------
Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>
* enable cpu bnb path
* fix style
* fix code style
* fix 4 bit path
* Update src/transformers/utils/import_utils.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* add multi backend refactor tests
* fix style
* tweak 4bit quantizer + fix corresponding tests
* tweak 8bit quantizer + *try* fixing corresponding tests
* fix dequant bnb 8bit
* account for Intel CPU in variability of expected outputs
* enable cpu and xpu device map
* further tweaks to account for Intel CPU
* fix autocast to work with both cpu + cuda
* fix comments
* fix comments
* switch to testing_utils.torch_device
* allow for xpu in multi-gpu tests
* fix tests 4bit for CPU NF4
* fix bug with is_torch_xpu_available needing to be called as func
* avoid issue where test reports attr err due to other failure
* fix formatting
* fix typo from resolving of merge conflict
* polish based on last PR review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* fix CI
* Update src/transformers/integrations/integration_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/integrations/integration_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix error log
* fix error msg
* add \n in error log
* make quality
* rm bnb cuda restriction in doc
* cpu model don't need dispatch
* fix doc
* fix style
* check cuda avaliable in testing
* fix tests
* Update docs/source/en/model_doc/chameleon.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update docs/source/en/model_doc/llava_next.md
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update tests/quantization/bnb/test_4bit.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update tests/quantization/bnb/test_4bit.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* fix doc
* fix check multibackends
* fix import sort
* remove check torch in bnb
* docs: update bitsandbytes references with multi-backend info
* docs: fix small mistakes in bnb paragraph
* run formatting
* reveret bnb check
* move bnb multi-backend check to import_utils
* Update src/transformers/utils/import_utils.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* fix bnb check
* minor fix for bnb
* check lib first
* fix code style
* Revert "run formatting"
This reverts commit ac108c6d6b.
* fix format
* give warning when bnb version is low and no cuda found]
* fix device assignment check to be multi-device capable
* address akx feedback on get_avlbl_dev fn
* revert partially, as we don't want the function that public, as docs would be too much (enforced)
---------
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Fixed typo: insted to instead
* Fixed typo: relase to release
* Fixed typo: nighlty to nightly
* Fixed typos: versatible, benchamarks, becnhmark to versatile, benchmark, benchmarks
* Fixed typo in comment: quantizd to quantized
* Fixed typo: architecutre to architecture
* Fixed typo: contibution to contribution
* Fixed typo: Presequities to Prerequisites
* Fixed typo: faste to faster
* Fixed typo: extendeding to extending
* Fixed typo: segmetantion_maps to segmentation_maps
* Fixed typo: Alternativelly to Alternatively
* Fixed incorrectly defined variable: output to output_disabled
* Fixed typo in library name: tranformers.onnx to transformers.onnx
* Fixed missing import: import tensorflow as tf
* Fixed incorrectly defined variable: token_tensor to tokens_tensor
* Fixed missing import: import torch
* Fixed incorrectly defined variable and typo: uromaize to uromanize
* Fixed incorrectly defined variable and typo: uromaize to uromanize
* Fixed typo in function args: numpy.ndarry to numpy.ndarray
* Fixed Inconsistent Library Name: Torchscript to TorchScript
* Fixed Inconsistent Class Name: OneformerProcessor to OneFormerProcessor
* Fixed Inconsistent Class Named Typo: TFLNetForMultipleChoice to TFXLNetForMultipleChoice
* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch
* Fixed Inconsistent Function Name Typo: captureWarning to captureWarnings
* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch
* Fixed Inconsistent Class Name Typo: TrainingArgument to TrainingArguments
* Fixed Inconsistent Model Name Typo: Swin2R to Swin2SR
* Fixed Inconsistent Model Name Typo: EART to BERT
* Fixed Inconsistent Library Name Typo: TensorFLow to TensorFlow
* Fixed Broken Link for Speech Emotion Classification with Wav2Vec2
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed minor missing word Typo
* Fixed Punctuation: Two commas
* Fixed Punctuation: No Space between XLM-R and is
* Fixed Punctuation: No Space between [~accelerate.Accelerator.backward] and method
* Added backticks to display model.fit() in codeblock
* Added backticks to display openai-community/gpt2 in codeblock
* Fixed Minor Typo: will to with
* Fixed Minor Typo: is to are
* Fixed Minor Typo: in to on
* Fixed Minor Typo: inhibits to exhibits
* Fixed Minor Typo: they need to it needs
* Fixed Minor Typo: cast the load the checkpoints To load the checkpoints
* Fixed Inconsistent Class Name Typo: TFCamembertForCasualLM to TFCamembertForCausalLM
* Fixed typo in attribute name: outputs.last_hidden_states to outputs.last_hidden_state
* Added missing verbosity level: fatal
* Fixed Minor Typo: take To takes
* Fixed Minor Typo: heuristic To heuristics
* Fixed Minor Typo: setting To settings
* Fixed Minor Typo: Content To Contents
* Fixed Minor Typo: millions To million
* Fixed Minor Typo: difference To differences
* Fixed Minor Typo: while extract To which extracts
* Fixed Minor Typo: Hereby To Here
* Fixed Minor Typo: addition To additional
* Fixed Minor Typo: supports To supported
* Fixed Minor Typo: so that benchmark results TO as a consequence, benchmark
* Fixed Minor Typo: a To an
* Fixed Minor Typo: a To an
* Fixed Minor Typo: Chain-of-though To Chain-of-thought
* Add TorchAOHfQuantizer
Summary:
Enable loading torchao quantized model in huggingface.
Test Plan:
local test
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix a few issues
* style
* Added tests and addressed some comments about dtype conversion
* fix torch_dtype warning message
* fix tests
* style
* TorchAOConfig -> TorchAoConfig
* enable offload + fix memory with multi-gpu
* update torchao version requirement to 0.4.0
* better comments
* add torch.compile to torchao README, add perf number link
---------
Co-authored-by: Marc Sun <marc@huggingface.co>
* Change in quantization docs
* Update overview.md
* Update docs/source/en/quantization/overview.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>