mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 06:10:04 +06:00

* fix bnb documentation - move bnb documentation to `infer_gpu_many` * small refactoring - added text on infer_gpu_one - added a small note on infer_gpu_many - added customized multi gpu example on infer_gpu_many * Update docs/source/en/perf_infer_gpu_many.mdx Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * apply suggestions Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
76 lines
4.3 KiB
Plaintext
76 lines
4.3 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
-->
|
|
|
|
# Efficient Inference on a Single GPU
|
|
|
|
This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
|
|
|
|
## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
|
|
|
|
Note that this feature is also totally applicable in a multi GPU setup as well.
|
|
|
|
From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code.
|
|
The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision.
|
|
|
|

|
|
|
|
Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models.
|
|
For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).
|
|
|
|

|
|
|
|
Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
|
|
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).
|
|
|
|
### Requirements
|
|
|
|
- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100).
|
|
- Install the correct version of `bitsandbytes` by running:
|
|
`pip install bitsandbytes>=0.31.5`
|
|
- Install `accelerate`
|
|
`pip install accelerate>=0.12.0`
|
|
|
|
### Running mixed-int8 models - single GPU setup
|
|
|
|
After installing the required libraries, the way to load your mixed 8-bit model is as follows:
|
|
```py
|
|
model_name = "bigscience/bloom-2b5"
|
|
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
|
|
```
|
|
|
|
### Running mixed-int8 models - multi GPU setup
|
|
|
|
The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
|
|
```py
|
|
model_name = "bigscience/bloom-2b5"
|
|
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
|
|
```
|
|
But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:
|
|
|
|
```py
|
|
max_memory_mapping = {0: "1GB", 1: "2GB"}
|
|
model_name = "bigscience/bloom-3b"
|
|
model_8bit = AutoModelForCausalLM.from_pretrained(
|
|
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
|
|
)
|
|
```
|
|
In this example, the first GPU will use 1GB of memory and the second 2GB.
|
|
|
|
### Colab demos
|
|
|
|
With this method you can infer on models that were not possible to infer on a Google Colab before.
|
|
Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:
|
|
|
|
[](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)
|
|
|
|
Or this demo for BLOOM-3B:
|
|
|
|
[](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) |