[docs] Doc TOC updates (#23049)

* first draft of toc restructure * polishing based on feedback
2025-08-02 19:21:31 +06:00 · 2023-04-28 09:24:28 -04:00 · 2023-04-28 09:24:28 -04:00 · 521a8ffa53
commit 521a8ffa53
parent 4893d919f1
3 changed files with 65 additions and 544 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -8,45 +8,22 @@
  title: Get started
 - sections:
  - local: pipeline_tutorial
-    title: Pipelines for inference
+    title: Run inference with pipelines
  - local: autoclass_tutorial
-    title: Load pretrained instances with an AutoClass
+    title: Write portable code with AutoClass
  - local: preprocessing
-    title: Preprocess
+    title: Preprocess data
  - local: training
    title: Fine-tune a pretrained model
+  - local: run_scripts
+    title: Train with a script
  - local: accelerate
-    title: Distributed training with 🤗 Accelerate
+    title: Set up distributed training with 🤗 Accelerate
  - local: model_sharing
-    title: Share a model
+    title: Share your model
  title: Tutorials
 - sections:
  - sections:
-    - local: create_a_model
-      title: Create a custom architecture
-    - local: custom_models
-      title: Sharing custom models
-    - local: run_scripts
-      title: Train with a script
-    - local: sagemaker
-      title: Run training on Amazon SageMaker
-    - local: converting_tensorflow_models
-      title: Converting from TensorFlow checkpoints
-    - local: serialization
-      title: Export to ONNX
-    - local: torchscript
-      title: Export to TorchScript
-    - local: troubleshooting
-      title: Troubleshoot
-    title: General usage
-  - sections:
-    - local: fast_tokenizers
-      title: Use tokenizers from 🤗 Tokenizers
-    - local: multilingual
-      title: Inference for multilingual models
-    - local: generation_strategies
-      title: Text generation strategies
-    - sections:
      - local: tasks/sequence_classification
        title: Text classification
      - local: tasks/token_classification
@ -63,38 +40,67 @@
        title: Summarization
      - local: tasks/multiple_choice
        title: Multiple choice
-      title: Task guides
-      isExpanded: false
    title: Natural Language Processing
+    isExpanded: false
  - sections:
-    - local: tasks/audio_classification
-      title: Audio classification
-    - local: tasks/asr
-      title: Automatic speech recognition
+      - local: tasks/audio_classification
+        title: Audio classification
+      - local: tasks/asr
+        title: Automatic speech recognition
    title: Audio
+    isExpanded: false
  - sections:
-    - local: tasks/image_classification
-      title: Image classification
-    - local: tasks/semantic_segmentation
-      title: Semantic segmentation
-    - local: tasks/video_classification
-      title: Video classification
-    - local: tasks/object_detection
-      title: Object detection
-    - local: tasks/zero_shot_object_detection
-      title: Zero-shot object detection
-    - local: tasks/zero_shot_image_classification
-      title: Zero-shot image classification
-    - local: tasks/monocular_depth_estimation
-      title: Depth estimation
+      - local: tasks/image_classification
+        title: Image classification
+      - local: tasks/semantic_segmentation
+        title: Semantic segmentation
+      - local: tasks/video_classification
+        title: Video classification
+      - local: tasks/object_detection
+        title: Object detection
+      - local: tasks/zero_shot_object_detection
+        title: Zero-shot object detection
+      - local: tasks/zero_shot_image_classification
+        title: Zero-shot image classification
+      - local: tasks/monocular_depth_estimation
+        title: Depth estimation
    title: Computer Vision
+    isExpanded: false
  - sections:
-    - local: tasks/image_captioning
-      title: Image captioning
-    - local: tasks/document_question_answering
-      title: Document Question Answering
+      - local: tasks/image_captioning
+        title: Image captioning
+      - local: tasks/document_question_answering
+        title: Document Question Answering
    title: Multimodal
-  - sections:
+    isExpanded: false
+  title: Task Guides
+- sections:
+    - local: fast_tokenizers
+      title: Use fast tokenizers from 🤗 Tokenizers
+    - local: multilingual
+      title: Run inference with multilingual models
+    - local: generation_strategies
+      title: Customize text generation strategy
+    - local: create_a_model
+      title: Use model-specific APIs
+    - local: custom_models
+      title: Share a custom model
+    - local: sagemaker
+      title: Run training on Amazon SageMaker
+    - local: serialization
+      title: Export to ONNX
+    - local: torchscript
+      title: Export to TorchScript
+    - local: benchmarks
+      title: Benchmarks
+    - local: notebooks
+      title: Notebooks with examples
+    - local: community
+      title: Community resources
+    - local: troubleshooting
+      title: Troubleshoot
+  title: Developer guides
+- sections:
    - local: performance
      title: Overview
    - local: perf_train_gpu_one
@ -129,8 +135,8 @@
      title: Hyperparameter Search using Trainer API
    - local: tf_xla
      title: XLA Integration for TensorFlow Models
-    title: Performance and scalability
-  - sections:
+  title: Performance and scalability
+- sections:
    - local: contributing
      title: How to contribute to transformers?
    - local: add_new_model
@ -143,16 +149,8 @@
      title: Testing
    - local: pr_checks
      title: Checks on a Pull Request
-    title: Contribute
-  - local: notebooks
-    title: 🤗 Transformers Notebooks
-  - local: community
-    title: Community resources
-  - local: benchmarks
-    title: Benchmarks
-  - local: migration
-    title: Migrating from previous packages
-  title: How-to guides
+  title: Contribute
+
 - sections:
  - local: philosophy
    title: Philosophy
--- a/docs/source/en/converting_tensorflow_models.mdx
+++ b/docs/source/en/converting_tensorflow_models.mdx
@ -1,162 +0,0 @@
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Converting From Tensorflow Checkpoints
-
-A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints to models
-that can be loaded using the `from_pretrained` methods of the library.
-
-<Tip>
-
-Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
-transformers >= 2.3.0 installation.
-
-The documentation below reflects the **transformers-cli convert** command format.
-
-</Tip>
-
-## BERT
-
-You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the
-[convert_bert_original_tf_checkpoint_to_pytorch.py](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py) script.
-
-This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated
-configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from
-the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can
-be imported using `from_pretrained()` (see example in [quicktour](quicktour) , [run_glue.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_glue.py) ).
-
-You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
-checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (\
-`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.
-
-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.
-
-Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:
-
-```bash
-export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
-
-transformers-cli convert --model_type bert \
-  --tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \
-  --config $BERT_BASE_DIR/bert_config.json \
-  --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
-```
-
-You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).
-
-## ALBERT
-
-Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
-[convert_albert_original_tf_checkpoint_to_pytorch.py](https://github.com/huggingface/transformers/tree/main/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py) script.
-
-The CLI takes as input a TensorFlow checkpoint (three files starting with `model.ckpt-best`) and the accompanying
-configuration file (`albert_config.json`), then creates and saves a PyTorch model. To run this conversion you will
-need to have TensorFlow and PyTorch installed.
-
-Here is an example of the conversion process for the pre-trained `ALBERT Base` model:
-
-```bash
-export ALBERT_BASE_DIR=/path/to/albert/albert_base
-
-transformers-cli convert --model_type albert \
-  --tf_checkpoint $ALBERT_BASE_DIR/model.ckpt-best \
-  --config $ALBERT_BASE_DIR/albert_config.json \
-  --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
-```
-
-You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/albert#pre-trained-models).
-
-## OpenAI GPT
-
-Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
-save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm)\
-)
-
-```bash
-export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
-
-transformers-cli convert --model_type gpt \
-  --tf_checkpoint $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
-  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
-  [--config OPENAI_GPT_CONFIG] \
-  [--finetuning_task_name OPENAI_GPT_FINETUNED_TASK] \
-```
-
-## OpenAI GPT-2
-
-Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see [here](https://github.com/openai/gpt-2))
-
-```bash
-export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights
-
-transformers-cli convert --model_type gpt2 \
-  --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
-  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
-  [--config OPENAI_GPT2_CONFIG] \
-  [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]
-```
-
-## Transformer-XL
-
-Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models))
-
-```bash
-export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
-
-transformers-cli convert --model_type transfo_xl \
-  --tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
-  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
-  [--config TRANSFO_XL_CONFIG] \
-  [--finetuning_task_name TRANSFO_XL_FINETUNED_TASK]
-```
-
-## XLNet
-
-Here is an example of the conversion process for a pre-trained XLNet model:
-
-```bash
-export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
-export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
-
-transformers-cli convert --model_type xlnet \
-  --tf_checkpoint $TRANSFO_XL_CHECKPOINT_PATH \
-  --config $TRANSFO_XL_CONFIG_PATH \
-  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
-  [--finetuning_task_name XLNET_FINETUNED_TASK] \
-```
-
-## XLM
-
-Here is an example of the conversion process for a pre-trained XLM model:
-
-```bash
-export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint
-
-transformers-cli convert --model_type xlm \
-  --tf_checkpoint $XLM_CHECKPOINT_PATH \
-  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT
- [--config XML_CONFIG] \
- [--finetuning_task_name XML_FINETUNED_TASK]
-```
-
-## T5
-
-Here is an example of the conversion process for a pre-trained T5 model:
-
-```bash
-export T5=/path/to/t5/uncased_L-12_H-768_A-12
-
-transformers-cli convert --model_type t5 \
-  --tf_checkpoint $T5/t5_model.ckpt \
-  --config $T5/t5_config.json \
-  --pytorch_dump_output $T5/pytorch_model.bin
-```
--- a/docs/source/en/migration.mdx
+++ b/docs/source/en/migration.mdx
@ -1,315 +0,0 @@
-<!---
-Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Migrating from previous packages
-
-## Migrating from transformers `v3.x` to `v4.x`
-
-A couple of changes were introduced when the switch from version 3 to version 4 was done. Below is a summary of the
-expected changes:
-
-#### 1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.
-
-The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set.
-
-This introduces two breaking changes:
- The handling of overflowing tokens between the python and rust tokenizers is different.
- The rust tokenizers do not accept integers in the encoding methods.
-
-##### How to obtain the same behavior as v3.x in v4.x
-
- The pipelines now contain additional features out of the box. See the [token-classification pipeline with the `grouped_entities` flag](main_classes/pipelines#transformers.TokenClassificationPipeline).
- The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the `use_fast` flag by setting it to `False`:
-
-In version `v3.x`:
-```py
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
-```
-to obtain the same in version `v4.x`:
-```py
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
-```
-
-#### 2. SentencePiece is removed from the required dependencies
-
-The requirement on the SentencePiece dependency has been lifted from the `setup.py`. This is done so that we may have a channel on anaconda cloud without relying on `conda-forge`. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard `transformers` installation.
-
-This includes the **slow** versions of:
- `XLNetTokenizer`
- `AlbertTokenizer`
- `CamembertTokenizer`
- `MBartTokenizer`
- `PegasusTokenizer`
- `T5Tokenizer`
- `ReformerTokenizer`
- `XLMRobertaTokenizer`
-
-##### How to obtain the same behavior as v3.x in v4.x
-
-In order to obtain the same behavior as version `v3.x`, you should install `sentencepiece` additionally:
-
-In version `v3.x`:
-```bash
-pip install transformers
-```
-to obtain the same in version `v4.x`:
-```bash
-pip install transformers[sentencepiece]
-```
-or
-```bash
-pip install transformers sentencepiece
-```
-#### 3. The architecture of the repo has been updated so that each model resides in its folder
-
-The past and foreseeable addition of new models means that the number of files in the directory `src/transformers` keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.
-
-This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.
-
-##### How to obtain the same behavior as v3.x in v4.x
-
-In order to obtain the same behavior as version `v3.x`, you should update the path used to access the layers.
-
-In version `v3.x`:
-```bash
-from transformers.modeling_bert import BertLayer
-```
-to obtain the same in version `v4.x`:
-```bash
-from transformers.models.bert.modeling_bert import BertLayer
-```
-
-#### 4. Switching the `return_dict` argument to `True` by default
-
-The [`return_dict` argument](main_classes/output) enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
-
-This is a breaking change as the limitation of that tuple is that it cannot be unpacked: `value0, value1 = outputs` will not work.
-
-##### How to obtain the same behavior as v3.x in v4.x
-
-In order to obtain the same behavior as version `v3.x`, you should specify the `return_dict` argument to `False`, either in the model configuration or during the forward pass.
-
-In version `v3.x`:
-```bash
-model = BertModel.from_pretrained("bert-base-cased")
-outputs = model(**inputs)
-```
-to obtain the same in version `v4.x`:
-```bash
-model = BertModel.from_pretrained("bert-base-cased")
-outputs = model(**inputs, return_dict=False)
-```
-or
-```bash
-model = BertModel.from_pretrained("bert-base-cased", return_dict=False)
-outputs = model(**inputs)
-```
-
-#### 5. Removed some deprecated attributes
-
-Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in [#8604](https://github.com/huggingface/transformers/pull/8604).
-
-Here is a list of these attributes/methods/arguments and what their replacements should be:
-
-In several models, the labels become consistent with the other models:
- `masked_lm_labels` becomes `labels` in `AlbertForMaskedLM` and `AlbertForPreTraining`.
- `masked_lm_labels` becomes `labels` in `BertForMaskedLM` and `BertForPreTraining`.
- `masked_lm_labels` becomes `labels` in `DistilBertForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `ElectraForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `LongformerForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `MobileBertForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `RobertaForMaskedLM`.
- `lm_labels` becomes `labels` in `BartForConditionalGeneration`.
- `lm_labels` becomes `labels` in `GPT2DoubleHeadsModel`.
- `lm_labels` becomes `labels` in `OpenAIGPTDoubleHeadsModel`.
- `lm_labels` becomes `labels` in `T5ForConditionalGeneration`.
-
-In several models, the caching mechanism becomes consistent with the other models:
- `decoder_cached_states` becomes `past_key_values` in all BART-like, FSMT and T5 models.
- `decoder_past_key_values` becomes `past_key_values` in all BART-like, FSMT and T5 models.
- `past` becomes `past_key_values` in all CTRL models.
- `past` becomes `past_key_values` in all GPT-2 models.
-
-Regarding the tokenizer classes:
- The tokenizer attribute `max_len` becomes `model_max_length`.
- The tokenizer attribute `return_lengths` becomes `return_length`.
- The tokenizer encoding argument `is_pretokenized` becomes `is_split_into_words`.
-
-Regarding the `Trainer` class:
- The `Trainer` argument `tb_writer` is removed in favor of the callback `TensorBoardCallback(tb_writer=...)`.
- The `Trainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`.
- The `Trainer` attribute `data_collator` should be a callable.
- The `Trainer` method `_log` is deprecated in favor of `log`.
- The `Trainer` method `_training_step` is deprecated in favor of `training_step`.
- The `Trainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`.
- The `Trainer` method `is_local_master` is deprecated in favor of `is_local_process_zero`.
- The `Trainer` method `is_world_master` is deprecated in favor of `is_world_process_zero`.
-
-Regarding the `TFTrainer` class:
- The `TFTrainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`.
- The `Trainer` method `_log` is deprecated in favor of `log`.
- The `TFTrainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`.
- The `TFTrainer` method `_setup_wandb` is deprecated in favor of `setup_wandb`.
- The `TFTrainer` method `_run_model` is deprecated in favor of `run_model`.
-
-Regarding the `TrainingArguments` class:
- The `TrainingArguments` argument `evaluate_during_training` is deprecated in favor of `evaluation_strategy`.
-
-Regarding the Transfo-XL model:
- The Transfo-XL configuration attribute `tie_weight` becomes `tie_words_embeddings`.
- The Transfo-XL modeling method `reset_length` becomes `reset_memory_length`.
-
-Regarding pipelines:
- The `FillMaskPipeline` argument `topk` becomes `top_k`.
-
-
-
-## Migrating from pytorch-transformers to 🤗 Transformers
-
-Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to 🤗 Transformers.
-
-### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed
-
-To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed.
-
-If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change.
-
-If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments.
-
-## Migrating from pytorch-pretrained-bert
-
-Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to 🤗 Transformers
-
-### Models always output `tuples`
-
-The main breaking change when migrating from `pytorch-pretrained-bert` to 🤗 Transformers is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
-
-The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
-
-In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
-
-Here is a `pytorch-pretrained-bert` to 🤗 Transformers conversion example for a `BertForSequenceClassification` classification model:
-
-```python
-# Let's load our model
-model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
-
-# If you used to have this line in pytorch-pretrained-bert:
-loss = model(input_ids, labels=labels)
-
-# Now just use this line in 🤗 Transformers to extract the loss from the output tuple:
-outputs = model(input_ids, labels=labels)
-loss = outputs[0]
-
-# In 🤗 Transformers you can also have access to the logits:
-loss, logits = outputs[:2]
-
-# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
-model = BertForSequenceClassification.from_pretrained("bert-base-uncased", output_attentions=True)
-outputs = model(input_ids, labels=labels)
-loss, logits, attentions = outputs
-```
-
-### Serialization
-
-Breaking change in the `from_pretrained()`method:
-
-1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
-
-2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.
-
-Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
-
-Here is an example:
-
-```python
-### Let's load a model and tokenizer
-model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
-tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
-
-### Do some stuff to our model and tokenizer
-# Ex: add new tokens to the vocabulary and embeddings of our model
-tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"])
-model.resize_token_embeddings(len(tokenizer))
-# Train our model
-train(model)
-
-### Now let's save our model and tokenizer to a directory
-model.save_pretrained("./my_saved_model_directory/")
-tokenizer.save_pretrained("./my_saved_model_directory/")
-
-### Reload the model and the tokenizer
-model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/")
-tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/")
-```
-
-### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
-
-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
-
- it only implements weights decay correction,
- schedules are now externals (see below),
- gradient clipping is now also external (see below).
-
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
-
-The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
-
-Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:
-
-```python
-# Parameters:
-lr = 1e-3
-max_grad_norm = 1.0
-num_training_steps = 1000
-num_warmup_steps = 100
-warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1
-
-### Previously BertAdam optimizer was instantiated like this:
-optimizer = BertAdam(
-    model.parameters(),
-    lr=lr,
-    schedule="warmup_linear",
-    warmup=warmup_proportion,
-    num_training_steps=num_training_steps,
-)
-### and used like this:
-for batch in train_data:
-    loss = model(batch)
-    loss.backward()
-    optimizer.step()
-
-### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
-optimizer = AdamW(
-    model.parameters(), lr=lr, correct_bias=False
-)  # To reproduce BertAdam specific behavior set correct_bias=False
-scheduler = get_linear_schedule_with_warmup(
-    optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
-)  # PyTorch scheduler
-### and used like this:
-for batch in train_data:
-    loss = model(batch)
-    loss.backward()
-    torch.nn.utils.clip_grad_norm_(
-        model.parameters(), max_grad_norm
-    )  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
-    optimizer.step()
-    scheduler.step()
-```