mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 13:20:12 +06:00

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
252 lines
13 KiB
ReStructuredText
252 lines
13 KiB
ReStructuredText
***********************************************************************************************************************
|
|
Exporting transformers models
|
|
***********************************************************************************************************************
|
|
|
|
ONNX / ONNXRuntime
|
|
=======================================================================================================================
|
|
|
|
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
|
|
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
|
|
of hardware and dedicated optimizations.
|
|
|
|
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
|
|
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using
|
|
Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
|
|
|
|
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources.
|
|
The following command shows how easy it is to export a BERT model from the library, simply run:
|
|
|
|
.. code-block:: bash
|
|
|
|
python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased bert-base-cased.onnx
|
|
|
|
The conversion tool works for both PyTorch and Tensorflow models and ensures:
|
|
|
|
* The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
|
|
* The inputs and outputs are correctly generated to their ONNX counterpart.
|
|
* The generated model can be correctly loaded through onnxruntime.
|
|
|
|
.. note::
|
|
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
|
|
on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please
|
|
open up an issue on transformers.
|
|
|
|
|
|
Also, the conversion tool supports different options which let you tune the behavior of the generated model:
|
|
|
|
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and enables faster inference)
|
|
|
|
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction head(s))
|
|
|
|
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_))
|
|
|
|
|
|
Optimizations
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
|
|
Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
|
|
|
|
* Constant folding
|
|
* Attention Layer fusing
|
|
* Skip connection LayerNormalization fusing
|
|
* FastGeLU approximation
|
|
|
|
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances
|
|
if used on another machine with a different hardware configuration than the one used for exporting the model.
|
|
For this reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled,
|
|
ensuring the model can be easily exported to various hardware.
|
|
Optimizations can then be enabled when loading the model through ONNX runtime for inference.
|
|
|
|
|
|
.. note::
|
|
When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the model
|
|
because quantization would modify the underlying graph making it impossible for ONNX runtime to do the optimizations
|
|
afterwards.
|
|
|
|
.. note::
|
|
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
|
|
|
|
Quantization
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
ONNX exporter supports generating a quantized version of the model to allow efficient inference.
|
|
|
|
Quantization works by converting the memory representation of the parameters in the neural network
|
|
to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`)
|
|
which can express a wide-range of floating-point numbers with decent precision.
|
|
These properties are especially interesting at training where you want fine-grained representation.
|
|
|
|
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers
|
|
without changing the performances of the neural network.
|
|
|
|
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing
|
|
the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation)
|
|
according to the following formula:
|
|
|
|
.. math::
|
|
y_{float32} = scale * x_{int8} - zero\_point
|
|
|
|
.. note::
|
|
The quantization process will infer the parameter `scale` and `zero_point` from the neural network parameters
|
|
|
|
Leveraging tiny-integers has numerous advantages when it comes to inference:
|
|
|
|
* Storing fewer bits instead of 32 bits for the `float32` reduces the size of the model and makes it load faster.
|
|
* Integer operations execute a magnitude faster on modern hardware
|
|
* Integer operations require less power to do the computations
|
|
|
|
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize``
|
|
when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this
|
|
same script file.
|
|
|
|
Example of quantized BERT model export:
|
|
|
|
.. code-block:: bash
|
|
|
|
python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --quantize bert-base-cased.onnx
|
|
|
|
.. note::
|
|
Quantization support requires ONNX Runtime >= 1.4.0
|
|
|
|
.. note::
|
|
When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
|
|
above command will contain the original ONNX model storing `float32` weights.
|
|
The second one, with ``-quantized`` suffix, will hold the quantized parameters.
|
|
|
|
|
|
TorchScript
|
|
=======================================================================================================================
|
|
|
|
.. note::
|
|
This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
|
|
with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming
|
|
releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes
|
|
with compiled TorchScript.
|
|
|
|
|
|
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code".
|
|
Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
|
|
their model to be re-used in other programs, such as efficiency-oriented C++ programs.
|
|
|
|
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can
|
|
be reused in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using TorchScript.
|
|
|
|
Exporting a model requires two things:
|
|
|
|
* a forward pass with dummy inputs.
|
|
* model instantiation with the ``torchscript`` flag.
|
|
|
|
These necessities imply several things developers should be careful about. These are detailed below.
|
|
|
|
|
|
Implications
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
TorchScript flag and tied weights
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
This flag is necessary because most of the language models in this repository have tied weights between their
|
|
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore
|
|
it is necessary to untie and clone the weights beforehand.
|
|
|
|
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer
|
|
separate, which means that they should not be trained down the line. Training would de-synchronize the two layers,
|
|
leading to unexpected results.
|
|
|
|
This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
|
|
can be safely exported without the ``torchscript`` flag.
|
|
|
|
Dummy inputs and standard lengths
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
|
|
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
|
|
to create the "trace" of the model.
|
|
|
|
The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
|
|
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
|
|
as:
|
|
|
|
``The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2``
|
|
|
|
will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
|
|
input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
|
|
will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
|
|
resulting in more calculations.
|
|
|
|
It is recommended to be careful of the total number of operations done on each input and to follow performance closely
|
|
when exporting varying sequence-length models.
|
|
|
|
Using TorchScript in Python
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
Below is an example, showing how to save, load models as well as how to use the trace for inference.
|
|
|
|
Saving a model
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
|
|
according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
|
|
|
|
.. code-block:: python
|
|
|
|
from transformers import BertModel, BertTokenizer, BertConfig
|
|
import torch
|
|
|
|
enc = BertTokenizer.from_pretrained("bert-base-uncased")
|
|
|
|
# Tokenizing input text
|
|
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
|
|
tokenized_text = enc.tokenize(text)
|
|
|
|
# Masking one of the input tokens
|
|
masked_index = 8
|
|
tokenized_text[masked_index] = '[MASK]'
|
|
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
|
|
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
|
|
|
|
# Creating a dummy input
|
|
tokens_tensor = torch.tensor([indexed_tokens])
|
|
segments_tensors = torch.tensor([segments_ids])
|
|
dummy_input = [tokens_tensor, segments_tensors]
|
|
|
|
# Initializing the model with the torchscript flag
|
|
# Flag set to True even though it is not necessary as this model does not have an LM Head.
|
|
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
|
|
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
|
|
|
|
# Instantiating the model
|
|
model = BertModel(config)
|
|
|
|
# The model needs to be in evaluation mode
|
|
model.eval()
|
|
|
|
# If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
|
|
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
|
|
|
|
# Creating the trace
|
|
traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
|
|
torch.jit.save(traced_model, "traced_bert.pt")
|
|
|
|
Loading a model
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
|
|
We are re-using the previously initialised ``dummy_input``.
|
|
|
|
.. code-block:: python
|
|
|
|
loaded_model = torch.jit.load("traced_bert.pt")
|
|
loaded_model.eval()
|
|
|
|
all_encoder_layers, pooled_output = loaded_model(*dummy_input)
|
|
|
|
Using a traced model for inference
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Using the traced model for inference is as simple as using its ``__call__`` dunder method:
|
|
|
|
.. code-block:: python
|
|
|
|
traced_model(tokens_tensor, segments_tensors)
|