From eedd21b9e7ef38f628240b9fe62e3fc28b662e04 Mon Sep 17 00:00:00 2001
From: Nilay Bhatnagar <nnilayy.work@gmail.com>
Date: Mon, 9 Sep 2024 14:17:24 +0530
Subject: [PATCH] Fixed Majority of the Typos in `transformers[en]`
 Documentation (#33350)

* Fixed typo: insted to instead

* Fixed typo: relase to release

* Fixed typo: nighlty to nightly

* Fixed typos: versatible, benchamarks, becnhmark to versatile, benchmark, benchmarks

* Fixed typo in comment: quantizd to quantized

* Fixed typo: architecutre to architecture

* Fixed typo: contibution to contribution

* Fixed typo: Presequities to Prerequisites

* Fixed typo: faste to faster

* Fixed typo: extendeding to extending

* Fixed typo: segmetantion_maps to segmentation_maps

* Fixed typo: Alternativelly to Alternatively

* Fixed incorrectly defined variable: output to output_disabled

* Fixed typo in library name: tranformers.onnx to transformers.onnx

* Fixed missing import: import tensorflow as tf

* Fixed incorrectly defined variable: token_tensor to tokens_tensor

* Fixed missing import: import torch

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed typo in function args: numpy.ndarry to numpy.ndarray

* Fixed Inconsistent Library Name: Torchscript to TorchScript

* Fixed Inconsistent Class Name: OneformerProcessor to OneFormerProcessor

* Fixed Inconsistent Class Named Typo: TFLNetForMultipleChoice to TFXLNetForMultipleChoice

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Function Name Typo: captureWarning to captureWarnings

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Class Name Typo: TrainingArgument to TrainingArguments

* Fixed Inconsistent Model Name Typo: Swin2R to Swin2SR

* Fixed Inconsistent Model Name Typo: EART to BERT

* Fixed Inconsistent Library Name Typo: TensorFLow to TensorFlow

* Fixed Broken Link for Speech Emotion Classification with Wav2Vec2

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed Punctuation: Two commas

* Fixed Punctuation: No Space between XLM-R and is

* Fixed Punctuation: No Space between [~accelerate.Accelerator.backward] and method

* Added backticks to display model.fit() in codeblock

* Added backticks to display openai-community/gpt2 in codeblock

* Fixed Minor Typo: will to with

* Fixed Minor Typo: is to are

* Fixed Minor Typo: in to on

* Fixed Minor Typo: inhibits to exhibits

* Fixed Minor Typo: they need to it needs

* Fixed Minor Typo: cast the load the checkpoints To load the checkpoints

* Fixed Inconsistent Class Name Typo: TFCamembertForCasualLM to TFCamembertForCausalLM

* Fixed typo in attribute name: outputs.last_hidden_states to outputs.last_hidden_state

* Added missing verbosity level: fatal

* Fixed Minor Typo: take To takes

* Fixed Minor Typo: heuristic To heuristics

* Fixed Minor Typo: setting To settings

* Fixed Minor Typo: Content To Contents

* Fixed Minor Typo: millions To million

* Fixed Minor Typo: difference To differences

* Fixed Minor Typo: while extract To which extracts

* Fixed Minor Typo: Hereby To Here

* Fixed Minor Typo: addition To additional

* Fixed Minor Typo: supports To supported

* Fixed Minor Typo: so that benchmark results TO as a consequence, benchmark

* Fixed Minor Typo: a To an

* Fixed Minor Typo: a To an

* Fixed Minor Typo: Chain-of-though To Chain-of-thought
---
 docs/source/en/accelerate.md                       | 2 +-
 docs/source/en/autoclass_tutorial.md               | 2 +-
 docs/source/en/benchmarks.md                       | 4 ++--
 docs/source/en/bertology.md                        | 2 +-
 docs/source/en/community.md                        | 2 +-
 docs/source/en/debugging.md                        | 2 +-
 docs/source/en/generation_strategies.md            | 2 +-
 docs/source/en/installation.md                     | 2 +-
 docs/source/en/main_classes/logging.md             | 8 ++++----
 docs/source/en/main_classes/optimizer_schedules.md | 2 +-
 docs/source/en/main_classes/output.md              | 2 +-
 docs/source/en/main_classes/trainer.md             | 2 +-
 docs/source/en/model_doc/camembert.md              | 2 +-
 docs/source/en/model_doc/clipseg.md                | 2 +-
 docs/source/en/model_doc/code_llama.md             | 2 +-
 docs/source/en/model_doc/falcon_mamba.md           | 2 +-
 docs/source/en/model_doc/hiera.md                  | 2 +-
 docs/source/en/model_doc/jamba.md                  | 2 +-
 docs/source/en/model_doc/matcha.md                 | 2 +-
 docs/source/en/model_doc/mbart.md                  | 2 +-
 docs/source/en/model_doc/mixtral.md                | 2 +-
 docs/source/en/model_doc/mms.md                    | 6 +++---
 docs/source/en/model_doc/mpt.md                    | 2 +-
 docs/source/en/model_doc/oneformer.md              | 2 +-
 docs/source/en/model_doc/openai-gpt.md             | 2 +-
 docs/source/en/model_doc/phobert.md                | 2 +-
 docs/source/en/model_doc/seggpt.md                 | 2 +-
 docs/source/en/model_doc/swin2sr.md                | 2 +-
 docs/source/en/model_doc/vits.md                   | 6 +++---
 docs/source/en/model_doc/xlm-roberta.md            | 2 +-
 docs/source/en/model_doc/xlnet.md                  | 2 +-
 docs/source/en/model_sharing.md                    | 2 +-
 docs/source/en/pad_truncation.md                   | 2 +-
 docs/source/en/peft.md                             | 4 ++--
 docs/source/en/perf_hardware.md                    | 2 +-
 docs/source/en/perf_train_gpu_one.md               | 4 ++--
 docs/source/en/perf_train_tpu_tf.md                | 2 +-
 docs/source/en/performance.md                      | 2 +-
 docs/source/en/pr_checks.md                        | 2 +-
 docs/source/en/preprocessing.md                    | 2 +-
 docs/source/en/quantization/aqlm.md                | 2 +-
 docs/source/en/quantization/bitsandbytes.md        | 2 +-
 docs/source/en/quantization/eetq.md                | 2 +-
 docs/source/en/quantization/fbgemm_fp8.md          | 2 +-
 docs/source/en/quantization/hqq.md                 | 2 +-
 docs/source/en/quantization/quanto.md              | 4 ++--
 docs/source/en/quantization/torchao.md             | 2 +-
 docs/source/en/sagemaker.md                        | 2 +-
 docs/source/en/serialization.md                    | 4 ++--
 docs/source/en/tasks/asr.md                        | 2 +-
 docs/source/en/tasks/audio_classification.md       | 2 +-
 docs/source/en/tasks/image_text_to_text.md         | 2 +-
 docs/source/en/tasks/image_to_image.md             | 1 +
 docs/source/en/tasks/monocular_depth_estimation.md | 2 +-
 docs/source/en/tasks/prompting.md                  | 2 +-
 docs/source/en/tasks/text-to-speech.md             | 2 +-
 docs/source/en/tasks/video_classification.md       | 2 +-
 docs/source/en/tasks_explained.md                  | 2 +-
 docs/source/en/testing.md                          | 2 ++
 docs/source/en/torchscript.md                      | 2 +-
 docs/source/en/trainer.md                          | 2 +-
 61 files changed, 74 insertions(+), 71 deletions(-)

diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index b0f0e4efe64..e0a7a9c6562 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -46,7 +46,7 @@ The next step is to pass all the relevant training objects to the [`~accelerate.
 
 ## Backward
 
-The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`]method:
+The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`] method:
 
 ```py
 >>> for epoch in range(num_epochs):
diff --git a/docs/source/en/autoclass_tutorial.md b/docs/source/en/autoclass_tutorial.md
index eacfdb441c2..0f02f19ed29 100644
--- a/docs/source/en/autoclass_tutorial.md
+++ b/docs/source/en/autoclass_tutorial.md
@@ -110,7 +110,7 @@ Now you can access the `feature_maps` object from the first stage of the backbon
 
 ## AutoFeatureExtractor
 
-For audio tasks, a feature extractor processes the audio signal the correct input format.
+For audio tasks, a feature extractor processes the audio signal into the correct input format.
 
 Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
 
diff --git a/docs/source/en/benchmarks.md b/docs/source/en/benchmarks.md
index 1fd61cc8de4..c61a21bb532 100644
--- a/docs/source/en/benchmarks.md
+++ b/docs/source/en/benchmarks.md
@@ -35,7 +35,7 @@ The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly b
 
 <Tip>
 
-Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
+Here, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
 backward pass.
 
 </Tip>
@@ -368,7 +368,7 @@ This section lists a couple of best practices one should be aware of when benchm
   memory measurement it is recommended to run each memory benchmark in a separate process by making sure
   `no_multi_processing` is set to `True`.
 - One should always state the environment information when sharing the results of a model benchmark. Results can vary
-  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
+  heavily between different GPU devices, library versions, etc., as a consequence, benchmark results on their own are not very
   useful for the community.
 
 
diff --git a/docs/source/en/bertology.md b/docs/source/en/bertology.md
index ba1b4bd4002..a1b92a362cd 100644
--- a/docs/source/en/bertology.md
+++ b/docs/source/en/bertology.md
@@ -37,5 +37,5 @@ help people access the inner representations, mainly adapted from the great work
 - retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
   in https://arxiv.org/abs/1905.10650.
 
-To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) while extract information and prune a model pre-trained on
+To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) which extracts information and prune a model pre-trained on
 GLUE.
diff --git a/docs/source/en/community.md b/docs/source/en/community.md
index 3c0429e6812..1b77bee9d2d 100644
--- a/docs/source/en/community.md
+++ b/docs/source/en/community.md
@@ -63,7 +63,7 @@ This page regroups resources around 🤗 Transformers developed by the community
 | [Evaluate LUKE on TACRED, a relation extraction dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | How to evaluate *LukeForEntityPairClassification* on the TACRED dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) |
 | [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) |
 | [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) |
-| [Speech Emotion Classification with Wav2Vec2](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
+| [Speech Emotion Classification with Wav2Vec2](https://github.com/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
 | [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) |
 | [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) |
 | [Finetune T5 for Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | How to fine-tune *T5* on a Named Entity Recognition Task | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) |
diff --git a/docs/source/en/debugging.md b/docs/source/en/debugging.md
index 0f0b1132955..b760c80a3e8 100644
--- a/docs/source/en/debugging.md
+++ b/docs/source/en/debugging.md
@@ -203,7 +203,7 @@ This feature can be used with any `nn.Module`-based model.
 
 </Tip>
 
-If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in
+If you start getting `loss=NaN` or the model exhibits some other abnormal behavior due to `inf` or `nan` in
 activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
 you can accomplish that easily by activating a special module that will do the detection automatically.
 
diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md
index 28662bceadb..06e7e0b8ab3 100644
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@@ -456,7 +456,7 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
 ['Alice and Bob, a couple of friends of mine, who are both in the same office as']
 ```
 
-Alternativelly, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
+Alternatively, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
 to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259).
 ### DoLa Decoding
 
diff --git a/docs/source/en/installation.md b/docs/source/en/installation.md
index 3ed4edf3d8e..f4ce768c316 100644
--- a/docs/source/en/installation.md
+++ b/docs/source/en/installation.md
@@ -71,7 +71,7 @@ pip install 'transformers[tf-cpu]'
 
 M1 / ARM Users
 
-You will need to install the following before installing TensorFLow 2.0
+You will need to install the following before installing TensorFlow 2.0
 ```bash
 brew install cmake
 brew install pkg-config
diff --git a/docs/source/en/main_classes/logging.md b/docs/source/en/main_classes/logging.md
index 6a77001608c..5cbdf9ae27e 100644
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@@ -30,7 +30,7 @@ transformers.logging.set_verbosity_info()
 ```
 
 You can also use the environment variable `TRANSFORMERS_VERBOSITY` to override the default verbosity. You can set it
-to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example:
+to one of the following: `debug`, `info`, `warning`, `error`, `critical`, `fatal`. For example:
 
 ```bash
 TRANSFORMERS_VERBOSITY=error ./myprogram.py
@@ -65,7 +65,7 @@ verbose to the most verbose), those levels (with their corresponding int values
   critical errors.
 - `transformers.logging.ERROR` (int value, 40): only report errors.
 - `transformers.logging.WARNING` or `transformers.logging.WARN` (int value, 30): only reports error and
-  warnings. This the default level used by the library.
+  warnings. This is the default level used by the library.
 - `transformers.logging.INFO` (int value, 20): reports error, warnings and basic information.
 - `transformers.logging.DEBUG` (int value, 10): report all information.
 
@@ -77,10 +77,10 @@ Python has two logging systems that are often used in conjunction: `logging`, wh
 which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path
 that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation.
 
-We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarning` method to allow
+We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarnings` method to allow
 management of these warning messages by the verbosity setters above.
 
-What does that mean for developers of the library? We should respect the following heuristic:
+What does that mean for developers of the library? We should respect the following heuristics:
 - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
 - `logging` should be used for end-users of the library using it in every-day projects
 
diff --git a/docs/source/en/main_classes/optimizer_schedules.md b/docs/source/en/main_classes/optimizer_schedules.md
index e75306408f8..9815b430ab0 100644
--- a/docs/source/en/main_classes/optimizer_schedules.md
+++ b/docs/source/en/main_classes/optimizer_schedules.md
@@ -38,7 +38,7 @@ The `.optimization` module provides:
 
 ## Schedules
 
-### Learning Rate Schedules (Pytorch)
+### Learning Rate Schedules (PyTorch)
 
 [[autodoc]] SchedulerType
 
diff --git a/docs/source/en/main_classes/output.md b/docs/source/en/main_classes/output.md
index 3567cf62c44..300213d4513 100644
--- a/docs/source/en/main_classes/output.md
+++ b/docs/source/en/main_classes/output.md
@@ -42,7 +42,7 @@ an optional `attentions` attribute. Here we have the `loss` since we passed alon
 
 <Tip>
 
-When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_states` exactly.
+When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
 However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.
 
 </Tip>
diff --git a/docs/source/en/main_classes/trainer.md b/docs/source/en/main_classes/trainer.md
index 3f33ff1e505..21ba9ed935e 100644
--- a/docs/source/en/main_classes/trainer.md
+++ b/docs/source/en/main_classes/trainer.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API.
 
-[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArgument`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
+[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArguments`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
 
 <Tip warning={true}>
 
diff --git a/docs/source/en/model_doc/camembert.md b/docs/source/en/model_doc/camembert.md
index ab06ec100b1..fd872282d58 100644
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@@ -106,7 +106,7 @@ as the information relative to the inputs and outputs.
 
 [[autodoc]] TFCamembertModel
 
-## TFCamembertForCasualLM
+## TFCamembertForCausalLM
 
 [[autodoc]] TFCamembertForCausalLM
 
diff --git a/docs/source/en/model_doc/clipseg.md b/docs/source/en/model_doc/clipseg.md
index 320095bc190..005e6746d09 100644
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
 ## Overview
 
 The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke
-and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero- and one-shot image segmentation.
+and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero-shot and one-shot image segmentation.
 
 The abstract from the paper is the following:
 
diff --git a/docs/source/en/model_doc/code_llama.md b/docs/source/en/model_doc/code_llama.md
index a0e7f6366bb..6eb687a728a 100644
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@@ -34,7 +34,7 @@ This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). Th
 
 The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions:
 
-* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to cast the load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
+* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
 * `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
 * `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning.
 
diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index b6330415af1..cbec6378cc1 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -27,7 +27,7 @@ Due to its architecture, FalconMamba is significantly faster at inference and re
 
 Tips:
 
-- FalconMamba is mostly based on Mamba architecutre, the same [tips and best practices](./mamba) would be relevant here.
+- FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.
 
 The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
 
diff --git a/docs/source/en/model_doc/hiera.md b/docs/source/en/model_doc/hiera.md
index 9bd2816e7a9..c63c892c7c7 100644
--- a/docs/source/en/model_doc/hiera.md
+++ b/docs/source/en/model_doc/hiera.md
@@ -31,7 +31,7 @@ alt="drawing" width="600"/>
 
 <small> Hiera architecture. Taken from the <a href="https://arxiv.org/abs/2306.00989">original paper.</a> </small>
 
-This model was a joint contibution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).
+This model was a joint contribution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).
 
 ## Resources
 
diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md
index d8de36771da..c3f66c1825f 100644
--- a/docs/source/en/model_doc/jamba.md
+++ b/docs/source/en/model_doc/jamba.md
@@ -33,7 +33,7 @@ alt="drawing" width="600"/>
 
 ## Usage
 
-### Presequities
+### Prerequisites
 
 Jamba requires you use `transformers` version 4.39.0 or higher:
 ```bash
diff --git a/docs/source/en/model_doc/matcha.md b/docs/source/en/model_doc/matcha.md
index d4ee3305936..d26b88b16fa 100644
--- a/docs/source/en/model_doc/matcha.md
+++ b/docs/source/en/model_doc/matcha.md
@@ -61,7 +61,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True))
 
 ## Fine-tuning
 
-To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faste convergence:
+To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
 ```python
 from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
 
diff --git a/docs/source/en/model_doc/mbart.md b/docs/source/en/model_doc/mbart.md
index e7fc0bd53ef..ca529e957e2 100644
--- a/docs/source/en/model_doc/mbart.md
+++ b/docs/source/en/model_doc/mbart.md
@@ -83,7 +83,7 @@ keyword, and target text format passed with the `text_label` keyword argument.
 ## Overview of MBart-50
 
 MBart-50 was introduced in the [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
-Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding
+Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extending
 its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
 languages.
 
diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md
index b93acdec581..26eff8ec21a 100644
--- a/docs/source/en/model_doc/mixtral.md
+++ b/docs/source/en/model_doc/mixtral.md
@@ -31,7 +31,7 @@ Mixtral-8x7B is the second large language model (LLM) released by [mistral.ai](h
 Mixtral-8x7B is a decoder-only Transformer with the following architectural choices:
 
 - Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe).
-- Despite the model having 45 billion parameters,, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. 
+- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. 
 
 The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
 - Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
diff --git a/docs/source/en/model_doc/mms.md b/docs/source/en/model_doc/mms.md
index dc453248eef..7102b889664 100644
--- a/docs/source/en/model_doc/mms.md
+++ b/docs/source/en/model_doc/mms.md
@@ -242,7 +242,7 @@ export UROMAN=$(pwd)
 ```
 
 You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable 
-`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function:
+`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromanize` function:
 
 ```python
 import torch
@@ -270,9 +270,9 @@ def uromanize(input_string, uroman_path):
     return stdout.decode()[:-1]
 
 text = "이봐 무슨 일이야"
-uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
+uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
 
-inputs = tokenizer(text=uromaized_text, return_tensors="pt")
+inputs = tokenizer(text=uromanized_text, return_tensors="pt")
 
 set_seed(555)  # make deterministic
 with torch.no_grad():
diff --git a/docs/source/en/model_doc/mpt.md b/docs/source/en/model_doc/mpt.md
index f7e6fcc1438..113b42573f4 100644
--- a/docs/source/en/model_doc/mpt.md
+++ b/docs/source/en/model_doc/mpt.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 ## Overview
 
-The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models is a series of open source and commercially usable LLMs pre-trained on 1T tokens. 
+The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models are a series of open source and commercially usable LLMs pre-trained on 1T tokens. 
 
 MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi. 
 
diff --git a/docs/source/en/model_doc/oneformer.md b/docs/source/en/model_doc/oneformer.md
index 97a6aa64f54..0132a600ccc 100644
--- a/docs/source/en/model_doc/oneformer.md
+++ b/docs/source/en/model_doc/oneformer.md
@@ -39,7 +39,7 @@ This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3
 - If you want to train the model in a distributed environment across multiple nodes, then one should update the
   `get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be
   set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/SHI-Labs/OneFormer/blob/33ebb56ed34f970a30ae103e786c0cb64c653d9a/oneformer/modeling/criterion.py#L287).
-- One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneformerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs.
+- One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneFormerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs.
 - To get the final segmentation, depending on the task, you can call [`~OneFormerProcessor.post_process_semantic_segmentation`] or [`~OneFormerImageProcessor.post_process_instance_segmentation`] or [`~OneFormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`OneFormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
 
 ## Resources
diff --git a/docs/source/en/model_doc/openai-gpt.md b/docs/source/en/model_doc/openai-gpt.md
index 1fbfbbcd89e..09277858aa3 100644
--- a/docs/source/en/model_doc/openai-gpt.md
+++ b/docs/source/en/model_doc/openai-gpt.md
@@ -29,7 +29,7 @@ rendered properly in your Markdown viewer.
 
 OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
-pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
+pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.
 
 The abstract from the paper is the following:
 
diff --git a/docs/source/en/model_doc/phobert.md b/docs/source/en/model_doc/phobert.md
index 30a50275476..adf5900ebe2 100644
--- a/docs/source/en/model_doc/phobert.md
+++ b/docs/source/en/model_doc/phobert.md
@@ -54,7 +54,7 @@ This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The o
 
 <Tip> 
 
-PhoBERT implementation is the same as BERT, except for tokenization. Refer to [EART documentation](bert) for information on 
+PhoBERT implementation is the same as BERT, except for tokenization. Refer to [BERT documentation](bert) for information on 
 configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.  
 
 </Tip>
diff --git a/docs/source/en/model_doc/seggpt.md b/docs/source/en/model_doc/seggpt.md
index 5a68d38fc98..b53f5d6ca15 100644
--- a/docs/source/en/model_doc/seggpt.md
+++ b/docs/source/en/model_doc/seggpt.md
@@ -27,7 +27,7 @@ The abstract from the paper is the following:
 Tips:
 - One can use [`SegGptImageProcessor`] to prepare image input, prompt and mask to the model.
 - One can either use segmentation maps or RGB images as prompt masks. If using the latter make sure to set `do_convert_rgb=False` in the `preprocess` method.
-- It's highly advisable to pass `num_labels` when using `segmetantion_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case.
+- It's highly advisable to pass `num_labels` when using `segmentation_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case.
 - When doing inference with [`SegGptForImageSegmentation`] if your `batch_size` is greater than 1 you can use feature ensemble across your images by passing `feature_ensemble=True` in the forward method.
 
 Here's how to use the model for one-shot semantic segmentation:
diff --git a/docs/source/en/model_doc/swin2sr.md b/docs/source/en/model_doc/swin2sr.md
index dfee144e50c..18d6635feff 100644
--- a/docs/source/en/model_doc/swin2sr.md
+++ b/docs/source/en/model_doc/swin2sr.md
@@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
 ## Overview
 
 The Swin2SR model was proposed in [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
-Swin2R improves the [SwinIR](https://github.com/JingyunLiang/SwinIR/) model by incorporating [Swin Transformer v2](swinv2) layers which mitigates issues such as training instability, resolution gaps between pre-training
+Swin2SR improves the [SwinIR](https://github.com/JingyunLiang/SwinIR/) model by incorporating [Swin Transformer v2](swinv2) layers which mitigates issues such as training instability, resolution gaps between pre-training
 and fine-tuning, and hunger on data.
 
 The abstract from the paper is the following:
diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md
index 7a29586fc0b..42997cae1e7 100644
--- a/docs/source/en/model_doc/vits.md
+++ b/docs/source/en/model_doc/vits.md
@@ -127,7 +127,7 @@ export UROMAN=$(pwd)
 ```
 
 You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable 
-`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function:
+`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromanize` function:
 
 ```python
 import torch
@@ -155,9 +155,9 @@ def uromanize(input_string, uroman_path):
     return stdout.decode()[:-1]
 
 text = "이봐 무슨 일이야"
-uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
+uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
 
-inputs = tokenizer(text=uromaized_text, return_tensors="pt")
+inputs = tokenizer(text=uromanized_text, return_tensors="pt")
 
 set_seed(555)  # make deterministic
 with torch.no_grad():
diff --git a/docs/source/en/model_doc/xlm-roberta.md b/docs/source/en/model_doc/xlm-roberta.md
index 58540015232..414afba1168 100644
--- a/docs/source/en/model_doc/xlm-roberta.md
+++ b/docs/source/en/model_doc/xlm-roberta.md
@@ -43,7 +43,7 @@ low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% fo
 also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
 trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
 languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
-per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
+per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
 will make XLM-R code, data, and models publicly available.*
 
 This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
diff --git a/docs/source/en/model_doc/xlnet.md b/docs/source/en/model_doc/xlnet.md
index d2209c3d550..90b454e8af3 100644
--- a/docs/source/en/model_doc/xlnet.md
+++ b/docs/source/en/model_doc/xlnet.md
@@ -166,7 +166,7 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o
 [[autodoc]] TFXLNetForSequenceClassification
     - call
 
-## TFLNetForMultipleChoice
+## TFXLNetForMultipleChoice
 
 [[autodoc]] TFXLNetForMultipleChoice
     - call
diff --git a/docs/source/en/model_sharing.md b/docs/source/en/model_sharing.md
index 71ddd1e588e..ec5802cfee3 100644
--- a/docs/source/en/model_sharing.md
+++ b/docs/source/en/model_sharing.md
@@ -47,7 +47,7 @@ As a result, you can load a specific model version with the `revision` parameter
 ... )
 ```
 
-Files are also easily edited in a repository, and you can view the commit history as well as the difference:
+Files are also easily edited in a repository, and you can view the commit history as well as the differences:
 
 ![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png)
 
diff --git a/docs/source/en/pad_truncation.md b/docs/source/en/pad_truncation.md
index cc623bca48a..345f86283d1 100644
--- a/docs/source/en/pad_truncation.md
+++ b/docs/source/en/pad_truncation.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special **padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.
 
-In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: `padding`, `truncation` and `max_length`.
+In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to know are: `padding`, `truncation` and `max_length`.
 
 The `padding` argument controls padding. It can be a boolean or a string:
 
diff --git a/docs/source/en/peft.md b/docs/source/en/peft.md
index 01fe2fad87b..e1777114dbc 100644
--- a/docs/source/en/peft.md
+++ b/docs/source/en/peft.md
@@ -46,7 +46,7 @@ pip install git+https://github.com/huggingface/peft.git
 - [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3)
 - [AdaLoRA](https://arxiv.org/abs/2303.10512)
 
-If you want to use other PEFT methods, such as prompt learning or prompt tuning, or about the 🤗 PEFT library in general, please refer to the [documentation](https://huggingface.co/docs/peft/index).
+If you want to use other PEFT methods, such as prompt learning or prompt tuning, or learn about the 🤗 PEFT library in general, please refer to the [documentation](https://huggingface.co/docs/peft/index).
 
 
 ## Load a PEFT adapter
@@ -125,7 +125,7 @@ Now you can use [`~peft.PeftModel.set_adapter`] to set which adapter to use:
 ```py
 # use adapter_1
 model.set_adapter("adapter_1")
-output = model.generate(**inputs)
+output_disabled = model.generate(**inputs)
 print(tokenizer.decode(output_disabled[0], skip_special_tokens=True))
 
 # use adapter_2
diff --git a/docs/source/en/perf_hardware.md b/docs/source/en/perf_hardware.md
index eb09ab439b4..260fe5b71cc 100644
--- a/docs/source/en/perf_hardware.md
+++ b/docs/source/en/perf_hardware.md
@@ -116,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid
 
 So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture.
 
-Let's compare the execution of an openai-community/gpt2 language model training over a small sample of wikitext.
+Let's compare the execution of an `openai-community/gpt2` language model training over a small sample of wikitext.
 
 The results are:
 
diff --git a/docs/source/en/perf_train_gpu_one.md b/docs/source/en/perf_train_gpu_one.md
index 5a72bba768d..c90f2ca5848 100644
--- a/docs/source/en/perf_train_gpu_one.md
+++ b/docs/source/en/perf_train_gpu_one.md
@@ -395,7 +395,7 @@ Choose which backend to use by specifying it via `torch_compile_backend` in the
 * `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757)
 
 **Inference-only backend**s:
-* `dynamo.optimize("ofi")` -  Uses Torchscript optimize_for_inference.  [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
+* `dynamo.optimize("ofi")` -  Uses TorchScript optimize_for_inference.  [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
 * `dynamo.optimize("fx2trt")` -  Uses NVIDIA TensorRT for inference optimizations.  [Read more](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html)
 * `dynamo.optimize("onnxrt")` -  Uses ONNXRT for inference on CPU/GPU.  [Read more](https://onnxruntime.ai/)
 * `dynamo.optimize("ipex")` -  Uses IPEX for inference on CPU.  [Read more](https://github.com/intel/intel-extension-for-pytorch)
@@ -413,7 +413,7 @@ For example with a vanilla AdamW, the memory requirement for the optimizer state
 * Momentum: 4 bytes/param
 * Variance: 4 bytes/param
 
-Suppose a model with 7B parameters and 200 millions parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora).
+Suppose a model with 7B parameters and 200 million parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora).
 
 The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters).
 
diff --git a/docs/source/en/perf_train_tpu_tf.md b/docs/source/en/perf_train_tpu_tf.md
index 011421b629c..1897c1ad745 100644
--- a/docs/source/en/perf_train_tpu_tf.md
+++ b/docs/source/en/perf_train_tpu_tf.md
@@ -158,5 +158,5 @@ There was a lot in here, so let’s summarize with a quick checklist you can fol
 - Create your `TPUStrategy` and make sure dataset loading and model creation are inside the `strategy.scope()` (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))
 - Don’t forget to take `jit_compile=True` out again when you move to TPU!
 - 🙏🙏🙏🥺🥺🥺
-- Call model.fit()
+- Call `model.fit()`
 - You did it!
\ No newline at end of file
diff --git a/docs/source/en/performance.md b/docs/source/en/performance.md
index ccd78d326d5..94e756cf33a 100644
--- a/docs/source/en/performance.md
+++ b/docs/source/en/performance.md
@@ -24,7 +24,7 @@ Training large transformer models and deploying them to production present vario
 During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment 
 phase, the model can struggle to handle the required throughput in a production environment.
 
-This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case. 
+This documentation aims to assist you in overcoming these challenges and finding the optimal settings for your use-case. 
 The guides are divided into training and inference sections, as each comes with different challenges and solutions. 
 Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU 
 for training or CPU vs. GPU for inference.
diff --git a/docs/source/en/pr_checks.md b/docs/source/en/pr_checks.md
index 266cc1ca68d..efddf3a5b16 100644
--- a/docs/source/en/pr_checks.md
+++ b/docs/source/en/pr_checks.md
@@ -166,7 +166,7 @@ Note that instead of applying this to a whole class, you can apply it to the rel
 # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
 ```
 
-Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment:
+Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` instead of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment:
 
 ```py
 # Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
diff --git a/docs/source/en/preprocessing.md b/docs/source/en/preprocessing.md
index 1710fd5ecb8..1a6f071a335 100644
--- a/docs/source/en/preprocessing.md
+++ b/docs/source/en/preprocessing.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 [[open-in-colab]]
 
-Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
+Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, it needs to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
 
 * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
 * Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
diff --git a/docs/source/en/quantization/aqlm.md b/docs/source/en/quantization/aqlm.md
index d18f20e0c14..2e00d94cfcf 100644
--- a/docs/source/en/quantization/aqlm.md
+++ b/docs/source/en/quantization/aqlm.md
@@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
 > [!TIP]
 > Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
 
-Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
+Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
 
 Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
 ```bash
diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md
index 1d4b4b6013f..334b6145e53 100644
--- a/docs/source/en/quantization/bitsandbytes.md
+++ b/docs/source/en/quantization/bitsandbytes.md
@@ -274,7 +274,7 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
 
 ### Nested quantization
 
-Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
+Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
 
 ```py
 from transformers import BitsAndBytesConfig
diff --git a/docs/source/en/quantization/eetq.md b/docs/source/en/quantization/eetq.md
index b12ea942654..bf2c4e0e646 100644
--- a/docs/source/en/quantization/eetq.md
+++ b/docs/source/en/quantization/eetq.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization. 
 
-Make sure you have eetq installed from the [relase page](https://github.com/NetEase-FuXi/EETQ/releases)
+Make sure you have eetq installed from the [release page](https://github.com/NetEase-FuXi/EETQ/releases)
 ```
 pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
 ```
diff --git a/docs/source/en/quantization/fbgemm_fp8.md b/docs/source/en/quantization/fbgemm_fp8.md
index 4df194d31be..ff9e18f823c 100644
--- a/docs/source/en/quantization/fbgemm_fp8.md
+++ b/docs/source/en/quantization/fbgemm_fp8.md
@@ -31,7 +31,7 @@ Before you begin, make sure the following libraries are installed with their lat
 pip install --upgrade accelerate fbgemm-gpu torch
 ```
 
-If you are having issues with fbgemm-gpu and torch library, you might need to install the nighlty release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch)
+If you are having issues with fbgemm-gpu and torch library, you might need to install the nightly release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch)
 
 
 ```py
diff --git a/docs/source/en/quantization/hqq.md b/docs/source/en/quantization/hqq.md
index 4c834209060..11489808aec 100644
--- a/docs/source/en/quantization/hqq.md
+++ b/docs/source/en/quantization/hqq.md
@@ -64,6 +64,6 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
 
 ## Optimized Runtime
 
-HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
+HQQ supports various backends, including pure PyTorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
 For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
 For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
diff --git a/docs/source/en/quantization/quanto.md b/docs/source/en/quantization/quanto.md
index d8dee26279b..18135b2ec2f 100644
--- a/docs/source/en/quantization/quanto.md
+++ b/docs/source/en/quantization/quanto.md
@@ -55,7 +55,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cud
 
 Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
 
-Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following becnhmark (llama-2-7b on perplexity metric). You can find more benchamarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
+Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following benchmark (llama-2-7b on perplexity metric). You can find more benchmarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
 
 <div class="flex gap-4">
   <div>
@@ -63,4 +63,4 @@ Quanto library uses linear quantization algorithm for quantization. Even though
   </div>
 </div>
 
-The library is versatible enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
\ No newline at end of file
+The library is versatile enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
\ No newline at end of file
diff --git a/docs/source/en/quantization/torchao.md b/docs/source/en/quantization/torchao.md
index c9733497aa9..99ad60a9233 100644
--- a/docs/source/en/quantization/torchao.md
+++ b/docs/source/en/quantization/torchao.md
@@ -33,7 +33,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
 input_text = "What are we having for dinner?"
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 
-# compile the quantizd model to get speedup
+# compile the quantized model to get speedup
 import torchao
 torchao.quantization.utils.recommended_inductor_config_setter()
 quantized_model = torch.compile(quantized_model, mode="max-autotune")
diff --git a/docs/source/en/sagemaker.md b/docs/source/en/sagemaker.md
index 579caa499c2..41802d9d42b 100644
--- a/docs/source/en/sagemaker.md
+++ b/docs/source/en/sagemaker.md
@@ -22,7 +22,7 @@ rendered properly in your Markdown viewer.
 
 The documentation has been moved to [hf.co/docs/sagemaker](https://huggingface.co/docs/sagemaker). This page will be removed in `transformers` 5.0. 
 
-### Table of Content
+### Table of Contents
 
 - [Train Hugging Face models on Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train)
 - [Deploy Hugging Face models to Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference)
diff --git a/docs/source/en/serialization.md b/docs/source/en/serialization.md
index 5995d9042de..eacda34f711 100644
--- a/docs/source/en/serialization.md
+++ b/docs/source/en/serialization.md
@@ -153,11 +153,11 @@ directly.
 
 <Tip warning={true}>
 
-`tranformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.
+`transformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.
 
 </Tip>
 
-To export a 🤗 Transformers model to ONNX with `tranformers.onnx`, install extra dependencies:
+To export a 🤗 Transformers model to ONNX with `transformers.onnx`, install extra dependencies:
 
 ```bash
 pip install transformers[onnx]
diff --git a/docs/source/en/tasks/asr.md b/docs/source/en/tasks/asr.md
index 3222f70c4d2..2ddd972c3d2 100644
--- a/docs/source/en/tasks/asr.md
+++ b/docs/source/en/tasks/asr.md
@@ -196,7 +196,7 @@ Now instantiate your `DataCollatorForCTCWithPadding`:
 
 ## Evaluate
 
-Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
+Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
 
 ```py
 >>> import evaluate
diff --git a/docs/source/en/tasks/audio_classification.md b/docs/source/en/tasks/audio_classification.md
index c50107e44f1..4610e86d6a2 100644
--- a/docs/source/en/tasks/audio_classification.md
+++ b/docs/source/en/tasks/audio_classification.md
@@ -164,7 +164,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
 
 ## Evaluate
 
-Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
+Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
 
 ```py
 >>> import evaluate
diff --git a/docs/source/en/tasks/image_text_to_text.md b/docs/source/en/tasks/image_text_to_text.md
index e6670ae226c..74f6a3408bc 100644
--- a/docs/source/en/tasks/image_text_to_text.md
+++ b/docs/source/en/tasks/image_text_to_text.md
@@ -204,7 +204,7 @@ for value in generator:
 
 ## Fit models in smaller hardware
 
-VLMs are often large and need to be optimized to fit in smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency. 
+VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency. 
 
 First, install dependencies.
 
diff --git a/docs/source/en/tasks/image_to_image.md b/docs/source/en/tasks/image_to_image.md
index 6a11b515947..0bb74b36980 100644
--- a/docs/source/en/tasks/image_to_image.md
+++ b/docs/source/en/tasks/image_to_image.md
@@ -36,6 +36,7 @@ We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co
 
 ```python
 from transformers import pipeline
+import torch
 
 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device)
diff --git a/docs/source/en/tasks/monocular_depth_estimation.md b/docs/source/en/tasks/monocular_depth_estimation.md
index 1485cdadb48..e28bc86bc5d 100644
--- a/docs/source/en/tasks/monocular_depth_estimation.md
+++ b/docs/source/en/tasks/monocular_depth_estimation.md
@@ -159,7 +159,7 @@ def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invali
     """Converts a depth map to a color image.
 
     Args:
-        value (torch.Tensor, numpy.ndarry): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
+        value (torch.Tensor, numpy.ndarray): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
         vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None.
         vmax (float, optional):  vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None.
         cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'.
diff --git a/docs/source/en/tasks/prompting.md b/docs/source/en/tasks/prompting.md
index 9100d48396b..4e30fb1e0ee 100644
--- a/docs/source/en/tasks/prompting.md
+++ b/docs/source/en/tasks/prompting.md
@@ -290,7 +290,7 @@ Result: Modern tools often used to make gazpacho include
 #### Reasoning
 
 Reasoning is one of the most difficult tasks for LLMs, and achieving good results often requires applying advanced prompting techniques, like 
-[Chain-of-though](#chain-of-thought).
+[Chain-of-thought](#chain-of-thought).
 
 Let's try if we can make a model reason about a simple arithmetics task with a basic prompt: 
 
diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index 35a7d6e29b6..ad8c43a28e8 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -580,7 +580,7 @@ Load the model from the 🤗 Hub:
 >>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")
 ```
 
-Pick an example from the test dataset obtain a speaker embedding. 
+Pick an example from the test dataset to obtain a speaker embedding. 
 
 ```py 
 >>> example = dataset["test"][304]
diff --git a/docs/source/en/tasks/video_classification.md b/docs/source/en/tasks/video_classification.md
index f5519489640..15b3b7a969e 100644
--- a/docs/source/en/tasks/video_classification.md
+++ b/docs/source/en/tasks/video_classification.md
@@ -191,7 +191,7 @@ You should probably TRAIN this model on a down-stream task to be able to use it
 
 The warning is telling us we are throwing away some weights (e.g. the weights and bias of the `classifier` layer) and randomly initializing some others (the weights and bias of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.
 
-**Note** that [this checkpoint](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) leads to better performance on this task as the checkpoint was obtained fine-tuning on a similar downstream task having considerable domain overlap. You can check out [this checkpoint](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) which was obtained by fine-tuning `MCG-NJU/videomae-base-finetuned-kinetics`.  
+**Note** that [this checkpoint](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) leads to better performance on this task as the checkpoint was obtained by fine-tuning on a similar downstream task having considerable domain overlap. You can check out [this checkpoint](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) which was obtained by fine-tuning `MCG-NJU/videomae-base-finetuned-kinetics`.  
 
 ## Prepare the datasets for training
 
diff --git a/docs/source/en/tasks_explained.md b/docs/source/en/tasks_explained.md
index f860377c7c9..7c836f70cfc 100644
--- a/docs/source/en/tasks_explained.md
+++ b/docs/source/en/tasks_explained.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # How 🤗 Transformers solve tasks
 
-In [What 🤗 Transformers can do](task_summary), you learned about natural language processing (NLP), speech and audio, computer vision tasks, and some important applications of them. This page will look closely at how models solve these tasks and explain what's happening under the hood. There are many ways to solve a given task, some models may implement certain techniques or even approach the task from a new angle, but for Transformer models, the general idea is the same. Owing to its flexible architecture, most models are a variant of an encoder, decoder, or encoder-decoder structure. In addition to Transformer models, our library also has several convolutional neural networks (CNNs), which are still used today for computer vision tasks. We'll also explain how a modern CNN works.
+In [What 🤗 Transformers can do](task_summary), you learned about natural language processing (NLP), speech and audio, computer vision tasks, and some important applications of them. This page will look closely at how models solve these tasks and explain what's happening under the hood. There are many ways to solve a given task, some models may implement certain techniques or even approach the task from a new angle, but for Transformer models, the general idea is the same. Owing to its flexible architecture, most models are a variant of an encoder, a decoder, or an encoder-decoder structure. In addition to Transformer models, our library also has several convolutional neural networks (CNNs), which are still used today for computer vision tasks. We'll also explain how a modern CNN works.
 
 To explain how tasks are solved, we'll walk through what goes on inside the model to output useful predictions.
 
diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md
index 701d7220134..1da8a62456e 100644
--- a/docs/source/en/testing.md
+++ b/docs/source/en/testing.md
@@ -1226,6 +1226,8 @@ import numpy as np
 np.random.seed(seed)
 
 # tf RNG
+import tensorflow as tf 
+
 tf.random.set_seed(seed)
 ```
 
diff --git a/docs/source/en/torchscript.md b/docs/source/en/torchscript.md
index 171e337ca7f..b62e23468f8 100644
--- a/docs/source/en/torchscript.md
+++ b/docs/source/en/torchscript.md
@@ -219,7 +219,7 @@ You only need to modify the following line:
 
 ```diff
 - torch.jit.trace(model, [tokens_tensor, segments_tensors])
-+ torch.neuron.trace(model, [token_tensor, segments_tensors])
++ torch.neuron.trace(model, [tokens_tensor, segments_tensors])
 ```
 
 This enables the Neuron SDK to trace the model and optimize it for Inf1 instances.
diff --git a/docs/source/en/trainer.md b/docs/source/en/trainer.md
index f51833bfb6d..812c5fe1a2a 100644
--- a/docs/source/en/trainer.md
+++ b/docs/source/en/trainer.md
@@ -299,7 +299,7 @@ trainer = trl.SFTTrainer(
 trainer.train()
 ```
 
-To pass extra arguments supports by GaLore, you should pass correctly `optim_args`, for example:
+To pass extra arguments supported by GaLore, you should pass correctly `optim_args`, for example:
 
 ```python
 import torch