Fixed Majority of the Typos in transformers[en] Documentation (#33350)

* Fixed typo: insted to instead

* Fixed typo: relase to release

* Fixed typo: nighlty to nightly

* Fixed typos: versatible, benchamarks, becnhmark to versatile, benchmark, benchmarks

* Fixed typo in comment: quantizd to quantized

* Fixed typo: architecutre to architecture

* Fixed typo: contibution to contribution

* Fixed typo: Presequities to Prerequisites

* Fixed typo: faste to faster

* Fixed typo: extendeding to extending

* Fixed typo: segmetantion_maps to segmentation_maps

* Fixed typo: Alternativelly to Alternatively

* Fixed incorrectly defined variable: output to output_disabled

* Fixed typo in library name: tranformers.onnx to transformers.onnx

* Fixed missing import: import tensorflow as tf

* Fixed incorrectly defined variable: token_tensor to tokens_tensor

* Fixed missing import: import torch

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed typo in function args: numpy.ndarry to numpy.ndarray

* Fixed Inconsistent Library Name: Torchscript to TorchScript

* Fixed Inconsistent Class Name: OneformerProcessor to OneFormerProcessor

* Fixed Inconsistent Class Named Typo: TFLNetForMultipleChoice to TFXLNetForMultipleChoice

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Function Name Typo: captureWarning to captureWarnings

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Class Name Typo: TrainingArgument to TrainingArguments

* Fixed Inconsistent Model Name Typo: Swin2R to Swin2SR

* Fixed Inconsistent Model Name Typo: EART to BERT

* Fixed Inconsistent Library Name Typo: TensorFLow to TensorFlow

* Fixed Broken Link for Speech Emotion Classification with Wav2Vec2

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed Punctuation: Two commas

* Fixed Punctuation: No Space between XLM-R and is

* Fixed Punctuation: No Space between [~accelerate.Accelerator.backward] and method

* Added backticks to display model.fit() in codeblock

* Added backticks to display openai-community/gpt2 in codeblock

* Fixed Minor Typo: will to with

* Fixed Minor Typo: is to are

* Fixed Minor Typo: in to on

* Fixed Minor Typo: inhibits to exhibits

* Fixed Minor Typo: they need to it needs

* Fixed Minor Typo: cast the load the checkpoints To load the checkpoints

* Fixed Inconsistent Class Name Typo: TFCamembertForCasualLM to TFCamembertForCausalLM

* Fixed typo in attribute name: outputs.last_hidden_states to outputs.last_hidden_state

* Added missing verbosity level: fatal

* Fixed Minor Typo: take To takes

* Fixed Minor Typo: heuristic To heuristics

* Fixed Minor Typo: setting To settings

* Fixed Minor Typo: Content To Contents

* Fixed Minor Typo: millions To million

* Fixed Minor Typo: difference To differences

* Fixed Minor Typo: while extract To which extracts

* Fixed Minor Typo: Hereby To Here

* Fixed Minor Typo: addition To additional

* Fixed Minor Typo: supports To supported

* Fixed Minor Typo: so that benchmark results TO as a consequence, benchmark

* Fixed Minor Typo: a To an

* Fixed Minor Typo: a To an

* Fixed Minor Typo: Chain-of-though To Chain-of-thought
This commit is contained in:
Nilay Bhatnagar 2024-09-09 14:17:24 +05:30 committed by GitHub
parent 489cbfd6d3
commit eedd21b9e7
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
61 changed files with 74 additions and 71 deletions

View File

@ -46,7 +46,7 @@ The next step is to pass all the relevant training objects to the [`~accelerate.
## Backward ## Backward
The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`]method: The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`] method:
```py ```py
>>> for epoch in range(num_epochs): >>> for epoch in range(num_epochs):

View File

@ -110,7 +110,7 @@ Now you can access the `feature_maps` object from the first stage of the backbon
## AutoFeatureExtractor ## AutoFeatureExtractor
For audio tasks, a feature extractor processes the audio signal the correct input format. For audio tasks, a feature extractor processes the audio signal into the correct input format.
Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]: Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

View File

@ -35,7 +35,7 @@ The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly b
<Tip> <Tip>
Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and Here, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
backward pass. backward pass.
</Tip> </Tip>
@ -368,7 +368,7 @@ This section lists a couple of best practices one should be aware of when benchm
memory measurement it is recommended to run each memory benchmark in a separate process by making sure memory measurement it is recommended to run each memory benchmark in a separate process by making sure
`no_multi_processing` is set to `True`. `no_multi_processing` is set to `True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary - One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very heavily between different GPU devices, library versions, etc., as a consequence, benchmark results on their own are not very
useful for the community. useful for the community.

View File

@ -37,5 +37,5 @@ help people access the inner representations, mainly adapted from the great work
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained - retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650. in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) while extract information and prune a model pre-trained on To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) which extracts information and prune a model pre-trained on
GLUE. GLUE.

View File

@ -63,7 +63,7 @@ This page regroups resources around 🤗 Transformers developed by the community
| [Evaluate LUKE on TACRED, a relation extraction dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | How to evaluate *LukeForEntityPairClassification* on the TACRED dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | | [Evaluate LUKE on TACRED, a relation extraction dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | How to evaluate *LukeForEntityPairClassification* on the TACRED dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) |
| [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | | [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) |
| [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | | [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) |
| [Speech Emotion Classification with Wav2Vec2](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | | [Speech Emotion Classification with Wav2Vec2](https://github.com/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
| [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | | [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) |
| [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | | [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) |
| [Finetune T5 for Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | How to fine-tune *T5* on a Named Entity Recognition Task | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) | | [Finetune T5 for Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | How to fine-tune *T5* on a Named Entity Recognition Task | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) |

View File

@ -203,7 +203,7 @@ This feature can be used with any `nn.Module`-based model.
</Tip> </Tip>
If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in If you start getting `loss=NaN` or the model exhibits some other abnormal behavior due to `inf` or `nan` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
you can accomplish that easily by activating a special module that will do the detection automatically. you can accomplish that easily by activating a special module that will do the detection automatically.

View File

@ -456,7 +456,7 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
['Alice and Bob, a couple of friends of mine, who are both in the same office as'] ['Alice and Bob, a couple of friends of mine, who are both in the same office as']
``` ```
Alternativelly, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed Alternatively, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259). to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259).
### DoLa Decoding ### DoLa Decoding

View File

@ -71,7 +71,7 @@ pip install 'transformers[tf-cpu]'
M1 / ARM Users M1 / ARM Users
You will need to install the following before installing TensorFLow 2.0 You will need to install the following before installing TensorFlow 2.0
```bash ```bash
brew install cmake brew install cmake
brew install pkg-config brew install pkg-config

View File

@ -30,7 +30,7 @@ transformers.logging.set_verbosity_info()
``` ```
You can also use the environment variable `TRANSFORMERS_VERBOSITY` to override the default verbosity. You can set it You can also use the environment variable `TRANSFORMERS_VERBOSITY` to override the default verbosity. You can set it
to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example: to one of the following: `debug`, `info`, `warning`, `error`, `critical`, `fatal`. For example:
```bash ```bash
TRANSFORMERS_VERBOSITY=error ./myprogram.py TRANSFORMERS_VERBOSITY=error ./myprogram.py
@ -65,7 +65,7 @@ verbose to the most verbose), those levels (with their corresponding int values
critical errors. critical errors.
- `transformers.logging.ERROR` (int value, 40): only report errors. - `transformers.logging.ERROR` (int value, 40): only report errors.
- `transformers.logging.WARNING` or `transformers.logging.WARN` (int value, 30): only reports error and - `transformers.logging.WARNING` or `transformers.logging.WARN` (int value, 30): only reports error and
warnings. This the default level used by the library. warnings. This is the default level used by the library.
- `transformers.logging.INFO` (int value, 20): reports error, warnings and basic information. - `transformers.logging.INFO` (int value, 20): reports error, warnings and basic information.
- `transformers.logging.DEBUG` (int value, 10): report all information. - `transformers.logging.DEBUG` (int value, 10): report all information.
@ -77,10 +77,10 @@ Python has two logging systems that are often used in conjunction: `logging`, wh
which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path
that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation. that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation.
We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarning` method to allow We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarnings` method to allow
management of these warning messages by the verbosity setters above. management of these warning messages by the verbosity setters above.
What does that mean for developers of the library? We should respect the following heuristic: What does that mean for developers of the library? We should respect the following heuristics:
- `warnings` should be favored for developers of the library and libraries dependent on `transformers` - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
- `logging` should be used for end-users of the library using it in every-day projects - `logging` should be used for end-users of the library using it in every-day projects

View File

@ -38,7 +38,7 @@ The `.optimization` module provides:
## Schedules ## Schedules
### Learning Rate Schedules (Pytorch) ### Learning Rate Schedules (PyTorch)
[[autodoc]] SchedulerType [[autodoc]] SchedulerType

View File

@ -42,7 +42,7 @@ an optional `attentions` attribute. Here we have the `loss` since we passed alon
<Tip> <Tip>
When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_states` exactly. When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned. However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.
</Tip> </Tip>

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API. The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API.
[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArgument`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation. [`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArguments`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
<Tip warning={true}> <Tip warning={true}>

View File

@ -106,7 +106,7 @@ as the information relative to the inputs and outputs.
[[autodoc]] TFCamembertModel [[autodoc]] TFCamembertModel
## TFCamembertForCasualLM ## TFCamembertForCausalLM
[[autodoc]] TFCamembertForCausalLM [[autodoc]] TFCamembertForCausalLM

View File

@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
## Overview ## Overview
The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke
and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero- and one-shot image segmentation. and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero-shot and one-shot image segmentation.
The abstract from the paper is the following: The abstract from the paper is the following:

View File

@ -34,7 +34,7 @@ This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). Th
The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions: The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions:
* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to cast the load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. * `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
* `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning. * `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
* `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning. * `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning.

View File

@ -27,7 +27,7 @@ Due to its architecture, FalconMamba is significantly faster at inference and re
Tips: Tips:
- FalconMamba is mostly based on Mamba architecutre, the same [tips and best practices](./mamba) would be relevant here. - FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data. The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

View File

@ -31,7 +31,7 @@ alt="drawing" width="600"/>
<small> Hiera architecture. Taken from the <a href="https://arxiv.org/abs/2306.00989">original paper.</a> </small> <small> Hiera architecture. Taken from the <a href="https://arxiv.org/abs/2306.00989">original paper.</a> </small>
This model was a joint contibution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera). This model was a joint contribution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).
## Resources ## Resources

View File

@ -33,7 +33,7 @@ alt="drawing" width="600"/>
## Usage ## Usage
### Presequities ### Prerequisites
Jamba requires you use `transformers` version 4.39.0 or higher: Jamba requires you use `transformers` version 4.39.0 or higher:
```bash ```bash

View File

@ -61,7 +61,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True))
## Fine-tuning ## Fine-tuning
To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faste convergence: To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
```python ```python
from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

View File

@ -83,7 +83,7 @@ keyword, and target text format passed with the `text_label` keyword argument.
## Overview of MBart-50 ## Overview of MBart-50
MBart-50 was introduced in the [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav MBart-50 was introduced in the [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extending
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50 its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
languages. languages.

View File

@ -31,7 +31,7 @@ Mixtral-8x7B is the second large language model (LLM) released by [mistral.ai](h
Mixtral-8x7B is a decoder-only Transformer with the following architectural choices: Mixtral-8x7B is a decoder-only Transformer with the following architectural choices:
- Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe). - Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe).
- Despite the model having 45 billion parameters,, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. - Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral): The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens - Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens

View File

@ -242,7 +242,7 @@ export UROMAN=$(pwd)
``` ```
You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function: `UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromanize` function:
```python ```python
import torch import torch
@ -270,9 +270,9 @@ def uromanize(input_string, uroman_path):
return stdout.decode()[:-1] return stdout.decode()[:-1]
text = "이봐 무슨 일이야" text = "이봐 무슨 일이야"
uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
inputs = tokenizer(text=uromaized_text, return_tensors="pt") inputs = tokenizer(text=uromanized_text, return_tensors="pt")
set_seed(555) # make deterministic set_seed(555) # make deterministic
with torch.no_grad(): with torch.no_grad():

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
## Overview ## Overview
The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models is a series of open source and commercially usable LLMs pre-trained on 1T tokens. The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models are a series of open source and commercially usable LLMs pre-trained on 1T tokens.
MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi. MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi.

View File

@ -39,7 +39,7 @@ This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3
- If you want to train the model in a distributed environment across multiple nodes, then one should update the - If you want to train the model in a distributed environment across multiple nodes, then one should update the
`get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be `get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/SHI-Labs/OneFormer/blob/33ebb56ed34f970a30ae103e786c0cb64c653d9a/oneformer/modeling/criterion.py#L287). set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/SHI-Labs/OneFormer/blob/33ebb56ed34f970a30ae103e786c0cb64c653d9a/oneformer/modeling/criterion.py#L287).
- One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneformerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs. - One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneFormerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs.
- To get the final segmentation, depending on the task, you can call [`~OneFormerProcessor.post_process_semantic_segmentation`] or [`~OneFormerImageProcessor.post_process_instance_segmentation`] or [`~OneFormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`OneFormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. - To get the final segmentation, depending on the task, you can call [`~OneFormerProcessor.post_process_semantic_segmentation`] or [`~OneFormerImageProcessor.post_process_instance_segmentation`] or [`~OneFormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`OneFormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
## Resources ## Resources

View File

@ -29,7 +29,7 @@ rendered properly in your Markdown viewer.
OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus. pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.
The abstract from the paper is the following: The abstract from the paper is the following:

View File

@ -54,7 +54,7 @@ This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The o
<Tip> <Tip>
PhoBERT implementation is the same as BERT, except for tokenization. Refer to [EART documentation](bert) for information on PhoBERT implementation is the same as BERT, except for tokenization. Refer to [BERT documentation](bert) for information on
configuration classes and their parameters. PhoBERT-specific tokenizer is documented below. configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.
</Tip> </Tip>

View File

@ -27,7 +27,7 @@ The abstract from the paper is the following:
Tips: Tips:
- One can use [`SegGptImageProcessor`] to prepare image input, prompt and mask to the model. - One can use [`SegGptImageProcessor`] to prepare image input, prompt and mask to the model.
- One can either use segmentation maps or RGB images as prompt masks. If using the latter make sure to set `do_convert_rgb=False` in the `preprocess` method. - One can either use segmentation maps or RGB images as prompt masks. If using the latter make sure to set `do_convert_rgb=False` in the `preprocess` method.
- It's highly advisable to pass `num_labels` when using `segmetantion_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case. - It's highly advisable to pass `num_labels` when using `segmentation_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case.
- When doing inference with [`SegGptForImageSegmentation`] if your `batch_size` is greater than 1 you can use feature ensemble across your images by passing `feature_ensemble=True` in the forward method. - When doing inference with [`SegGptForImageSegmentation`] if your `batch_size` is greater than 1 you can use feature ensemble across your images by passing `feature_ensemble=True` in the forward method.
Here's how to use the model for one-shot semantic segmentation: Here's how to use the model for one-shot semantic segmentation:

View File

@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
## Overview ## Overview
The Swin2SR model was proposed in [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. The Swin2SR model was proposed in [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
Swin2R improves the [SwinIR](https://github.com/JingyunLiang/SwinIR/) model by incorporating [Swin Transformer v2](swinv2) layers which mitigates issues such as training instability, resolution gaps between pre-training Swin2SR improves the [SwinIR](https://github.com/JingyunLiang/SwinIR/) model by incorporating [Swin Transformer v2](swinv2) layers which mitigates issues such as training instability, resolution gaps between pre-training
and fine-tuning, and hunger on data. and fine-tuning, and hunger on data.
The abstract from the paper is the following: The abstract from the paper is the following:

View File

@ -127,7 +127,7 @@ export UROMAN=$(pwd)
``` ```
You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function: `UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromanize` function:
```python ```python
import torch import torch
@ -155,9 +155,9 @@ def uromanize(input_string, uroman_path):
return stdout.decode()[:-1] return stdout.decode()[:-1]
text = "이봐 무슨 일이야" text = "이봐 무슨 일이야"
uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
inputs = tokenizer(text=uromaized_text, return_tensors="pt") inputs = tokenizer(text=uromanized_text, return_tensors="pt")
set_seed(555) # make deterministic set_seed(555) # make deterministic
with torch.no_grad(): with torch.no_grad():

View File

@ -43,7 +43,7 @@ low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% fo
also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
will make XLM-R code, data, and models publicly available.* will make XLM-R code, data, and models publicly available.*
This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).

View File

@ -166,7 +166,7 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o
[[autodoc]] TFXLNetForSequenceClassification [[autodoc]] TFXLNetForSequenceClassification
- call - call
## TFLNetForMultipleChoice ## TFXLNetForMultipleChoice
[[autodoc]] TFXLNetForMultipleChoice [[autodoc]] TFXLNetForMultipleChoice
- call - call

View File

@ -47,7 +47,7 @@ As a result, you can load a specific model version with the `revision` parameter
... ) ... )
``` ```
Files are also easily edited in a repository, and you can view the commit history as well as the difference: Files are also easily edited in a repository, and you can view the commit history as well as the differences:
![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png) ![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png)

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special **padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences. Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special **padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.
In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: `padding`, `truncation` and `max_length`. In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to know are: `padding`, `truncation` and `max_length`.
The `padding` argument controls padding. It can be a boolean or a string: The `padding` argument controls padding. It can be a boolean or a string:

View File

@ -46,7 +46,7 @@ pip install git+https://github.com/huggingface/peft.git
- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) - [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3)
- [AdaLoRA](https://arxiv.org/abs/2303.10512) - [AdaLoRA](https://arxiv.org/abs/2303.10512)
If you want to use other PEFT methods, such as prompt learning or prompt tuning, or about the 🤗 PEFT library in general, please refer to the [documentation](https://huggingface.co/docs/peft/index). If you want to use other PEFT methods, such as prompt learning or prompt tuning, or learn about the 🤗 PEFT library in general, please refer to the [documentation](https://huggingface.co/docs/peft/index).
## Load a PEFT adapter ## Load a PEFT adapter
@ -125,7 +125,7 @@ Now you can use [`~peft.PeftModel.set_adapter`] to set which adapter to use:
```py ```py
# use adapter_1 # use adapter_1
model.set_adapter("adapter_1") model.set_adapter("adapter_1")
output = model.generate(**inputs) output_disabled = model.generate(**inputs)
print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) print(tokenizer.decode(output_disabled[0], skip_special_tokens=True))
# use adapter_2 # use adapter_2

View File

@ -116,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid
So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture. So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture.
Let's compare the execution of an openai-community/gpt2 language model training over a small sample of wikitext. Let's compare the execution of an `openai-community/gpt2` language model training over a small sample of wikitext.
The results are: The results are:

View File

@ -395,7 +395,7 @@ Choose which backend to use by specifying it via `torch_compile_backend` in the
* `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757) * `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757)
**Inference-only backend**s: **Inference-only backend**s:
* `dynamo.optimize("ofi")` - Uses Torchscript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) * `dynamo.optimize("ofi")` - Uses TorchScript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
* `dynamo.optimize("fx2trt")` - Uses NVIDIA TensorRT for inference optimizations. [Read more](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) * `dynamo.optimize("fx2trt")` - Uses NVIDIA TensorRT for inference optimizations. [Read more](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html)
* `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/) * `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/)
* `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch) * `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch)
@ -413,7 +413,7 @@ For example with a vanilla AdamW, the memory requirement for the optimizer state
* Momentum: 4 bytes/param * Momentum: 4 bytes/param
* Variance: 4 bytes/param * Variance: 4 bytes/param
Suppose a model with 7B parameters and 200 millions parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora). Suppose a model with 7B parameters and 200 million parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora).
The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters). The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters).

View File

@ -158,5 +158,5 @@ There was a lot in here, so lets summarize with a quick checklist you can fol
- Create your `TPUStrategy` and make sure dataset loading and model creation are inside the `strategy.scope()` (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)) - Create your `TPUStrategy` and make sure dataset loading and model creation are inside the `strategy.scope()` (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))
- Dont forget to take `jit_compile=True` out again when you move to TPU! - Dont forget to take `jit_compile=True` out again when you move to TPU!
- 🙏🙏🙏🥺🥺🥺 - 🙏🙏🙏🥺🥺🥺
- Call model.fit() - Call `model.fit()`
- You did it! - You did it!

View File

@ -24,7 +24,7 @@ Training large transformer models and deploying them to production present vario
During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment
phase, the model can struggle to handle the required throughput in a production environment. phase, the model can struggle to handle the required throughput in a production environment.
This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case. This documentation aims to assist you in overcoming these challenges and finding the optimal settings for your use-case.
The guides are divided into training and inference sections, as each comes with different challenges and solutions. The guides are divided into training and inference sections, as each comes with different challenges and solutions.
Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU
for training or CPU vs. GPU for inference. for training or CPU vs. GPU for inference.

View File

@ -166,7 +166,7 @@ Note that instead of applying this to a whole class, you can apply it to the rel
# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
``` ```
Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment: Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` instead of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment:
```py ```py
# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta # Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
[[open-in-colab]] [[open-in-colab]]
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for: Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, it needs to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors. * Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.

View File

@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
> [!TIP] > [!TIP]
> Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)! > Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10): Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
```bash ```bash

View File

@ -274,7 +274,7 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
### Nested quantization ### Nested quantization
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps. Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
```py ```py
from transformers import BitsAndBytesConfig from transformers import BitsAndBytesConfig

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization. The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization.
Make sure you have eetq installed from the [relase page](https://github.com/NetEase-FuXi/EETQ/releases) Make sure you have eetq installed from the [release page](https://github.com/NetEase-FuXi/EETQ/releases)
``` ```
pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
``` ```

View File

@ -31,7 +31,7 @@ Before you begin, make sure the following libraries are installed with their lat
pip install --upgrade accelerate fbgemm-gpu torch pip install --upgrade accelerate fbgemm-gpu torch
``` ```
If you are having issues with fbgemm-gpu and torch library, you might need to install the nighlty release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch) If you are having issues with fbgemm-gpu and torch library, you might need to install the nightly release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch)
```py ```py

View File

@ -64,6 +64,6 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
## Optimized Runtime ## Optimized Runtime
HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training. HQQ supports various backends, including pure PyTorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090. For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend

View File

@ -55,7 +55,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cud
Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead. Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following becnhmark (llama-2-7b on perplexity metric). You can find more benchamarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation) Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following benchmark (llama-2-7b on perplexity metric). You can find more benchmarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
<div class="flex gap-4"> <div class="flex gap-4">
<div> <div>
@ -63,4 +63,4 @@ Quanto library uses linear quantization algorithm for quantization. Even though
</div> </div>
</div> </div>
The library is versatible enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant). The library is versatile enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).

View File

@ -33,7 +33,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?" input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# compile the quantizd model to get speedup # compile the quantized model to get speedup
import torchao import torchao
torchao.quantization.utils.recommended_inductor_config_setter() torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune") quantized_model = torch.compile(quantized_model, mode="max-autotune")

View File

@ -22,7 +22,7 @@ rendered properly in your Markdown viewer.
The documentation has been moved to [hf.co/docs/sagemaker](https://huggingface.co/docs/sagemaker). This page will be removed in `transformers` 5.0. The documentation has been moved to [hf.co/docs/sagemaker](https://huggingface.co/docs/sagemaker). This page will be removed in `transformers` 5.0.
### Table of Content ### Table of Contents
- [Train Hugging Face models on Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train) - [Train Hugging Face models on Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train)
- [Deploy Hugging Face models to Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference) - [Deploy Hugging Face models to Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference)

View File

@ -153,11 +153,11 @@ directly.
<Tip warning={true}> <Tip warning={true}>
`tranformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions. `transformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.
</Tip> </Tip>
To export a 🤗 Transformers model to ONNX with `tranformers.onnx`, install extra dependencies: To export a 🤗 Transformers model to ONNX with `transformers.onnx`, install extra dependencies:
```bash ```bash
pip install transformers[onnx] pip install transformers[onnx]

View File

@ -196,7 +196,7 @@ Now instantiate your `DataCollatorForCTCWithPadding`:
## Evaluate ## Evaluate
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
```py ```py
>>> import evaluate >>> import evaluate

View File

@ -164,7 +164,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
## Evaluate ## Evaluate
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
```py ```py
>>> import evaluate >>> import evaluate

View File

@ -204,7 +204,7 @@ for value in generator:
## Fit models in smaller hardware ## Fit models in smaller hardware
VLMs are often large and need to be optimized to fit in smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency. VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency.
First, install dependencies. First, install dependencies.

View File

@ -36,6 +36,7 @@ We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co
```python ```python
from transformers import pipeline from transformers import pipeline
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device) pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device)

View File

@ -159,7 +159,7 @@ def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invali
"""Converts a depth map to a color image. """Converts a depth map to a color image.
Args: Args:
value (torch.Tensor, numpy.ndarry): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed value (torch.Tensor, numpy.ndarray): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None. vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None.
vmax (float, optional): vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None. vmax (float, optional): vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None.
cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'. cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'.

View File

@ -290,7 +290,7 @@ Result: Modern tools often used to make gazpacho include
#### Reasoning #### Reasoning
Reasoning is one of the most difficult tasks for LLMs, and achieving good results often requires applying advanced prompting techniques, like Reasoning is one of the most difficult tasks for LLMs, and achieving good results often requires applying advanced prompting techniques, like
[Chain-of-though](#chain-of-thought). [Chain-of-thought](#chain-of-thought).
Let's try if we can make a model reason about a simple arithmetics task with a basic prompt: Let's try if we can make a model reason about a simple arithmetics task with a basic prompt:

View File

@ -580,7 +580,7 @@ Load the model from the 🤗 Hub:
>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl") >>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")
``` ```
Pick an example from the test dataset obtain a speaker embedding. Pick an example from the test dataset to obtain a speaker embedding.
```py ```py
>>> example = dataset["test"][304] >>> example = dataset["test"][304]

View File

@ -191,7 +191,7 @@ You should probably TRAIN this model on a down-stream task to be able to use it
The warning is telling us we are throwing away some weights (e.g. the weights and bias of the `classifier` layer) and randomly initializing some others (the weights and bias of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do. The warning is telling us we are throwing away some weights (e.g. the weights and bias of the `classifier` layer) and randomly initializing some others (the weights and bias of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.
**Note** that [this checkpoint](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) leads to better performance on this task as the checkpoint was obtained fine-tuning on a similar downstream task having considerable domain overlap. You can check out [this checkpoint](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) which was obtained by fine-tuning `MCG-NJU/videomae-base-finetuned-kinetics`. **Note** that [this checkpoint](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) leads to better performance on this task as the checkpoint was obtained by fine-tuning on a similar downstream task having considerable domain overlap. You can check out [this checkpoint](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) which was obtained by fine-tuning `MCG-NJU/videomae-base-finetuned-kinetics`.
## Prepare the datasets for training ## Prepare the datasets for training

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# How 🤗 Transformers solve tasks # How 🤗 Transformers solve tasks
In [What 🤗 Transformers can do](task_summary), you learned about natural language processing (NLP), speech and audio, computer vision tasks, and some important applications of them. This page will look closely at how models solve these tasks and explain what's happening under the hood. There are many ways to solve a given task, some models may implement certain techniques or even approach the task from a new angle, but for Transformer models, the general idea is the same. Owing to its flexible architecture, most models are a variant of an encoder, decoder, or encoder-decoder structure. In addition to Transformer models, our library also has several convolutional neural networks (CNNs), which are still used today for computer vision tasks. We'll also explain how a modern CNN works. In [What 🤗 Transformers can do](task_summary), you learned about natural language processing (NLP), speech and audio, computer vision tasks, and some important applications of them. This page will look closely at how models solve these tasks and explain what's happening under the hood. There are many ways to solve a given task, some models may implement certain techniques or even approach the task from a new angle, but for Transformer models, the general idea is the same. Owing to its flexible architecture, most models are a variant of an encoder, a decoder, or an encoder-decoder structure. In addition to Transformer models, our library also has several convolutional neural networks (CNNs), which are still used today for computer vision tasks. We'll also explain how a modern CNN works.
To explain how tasks are solved, we'll walk through what goes on inside the model to output useful predictions. To explain how tasks are solved, we'll walk through what goes on inside the model to output useful predictions.

View File

@ -1226,6 +1226,8 @@ import numpy as np
np.random.seed(seed) np.random.seed(seed)
# tf RNG # tf RNG
import tensorflow as tf
tf.random.set_seed(seed) tf.random.set_seed(seed)
``` ```

View File

@ -219,7 +219,7 @@ You only need to modify the following line:
```diff ```diff
- torch.jit.trace(model, [tokens_tensor, segments_tensors]) - torch.jit.trace(model, [tokens_tensor, segments_tensors])
+ torch.neuron.trace(model, [token_tensor, segments_tensors]) + torch.neuron.trace(model, [tokens_tensor, segments_tensors])
``` ```
This enables the Neuron SDK to trace the model and optimize it for Inf1 instances. This enables the Neuron SDK to trace the model and optimize it for Inf1 instances.

View File

@ -299,7 +299,7 @@ trainer = trl.SFTTrainer(
trainer.train() trainer.train()
``` ```
To pass extra arguments supports by GaLore, you should pass correctly `optim_args`, for example: To pass extra arguments supported by GaLore, you should pass correctly `optim_args`, for example:
```python ```python
import torch import torch