[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
2025-07-03 12:50:06 +06:00 · 2023-11-03 10:57:03 -04:00 · 2023-11-03 10:57:03 -04:00 · 5964f820db
commit 5964f820db
parent ad8ff96224
223 changed files with 1796 additions and 1116 deletions
--- a/docs/source/en/model_doc/albert.md
+++ b/docs/source/en/model_doc/albert.md
@ -45,7 +45,10 @@ self-supervised loss that focuses on modeling inter-sentence coherence, and show
 with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
 SQuAD benchmarks while having fewer parameters compared to BERT-large.*

-Tips:
+This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
+[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
+
+## Usage tips

 - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
  than the left.
@ -56,11 +59,7 @@ Tips:
 - Layers are split in groups that share parameters (to save memory).
 Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.

-
-This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
-[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -90,6 +89,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This

 [[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput

+<frameworkcontent>
+<pt>
+
 ## AlbertModel

 [[autodoc]] AlbertModel
@ -124,6 +126,10 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This
 [[autodoc]] AlbertForQuestionAnswering
    - forward

+</pt>
+
+<tf>
+
 ## TFAlbertModel

 [[autodoc]] TFAlbertModel
@ -159,6 +165,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This
 [[autodoc]] TFAlbertForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxAlbertModel

 [[autodoc]] FlaxAlbertModel
@ -193,3 +202,8 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This

 [[autodoc]] FlaxAlbertForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -24,7 +24,10 @@ The abstract from the paper is the following:

 *Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.*

-## Usage
+This model was contributed by [Alara Dirik](https://huggingface.co/adirik).
+The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper.
+
+## Usage example

 ALIGN uses EfficientNet to get visual features and BERT to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similarity score.

@ -56,9 +59,6 @@ probs = logits_per_image.softmax(dim=1)
 print(probs)
 ```

-This model was contributed by [Alara Dirik](https://huggingface.co/adirik).
-The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper.
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ALIGN.
@ -69,7 +69,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

-
 ## AlignConfig

 [[autodoc]] AlignConfig
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -31,7 +31,9 @@ teacher learning and contrastive learning. We validate our method through evalua
 performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with 
 CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*

-## Usage
+This model was contributed by [jongjyh](https://huggingface.co/jongjyh).
+
+## Usage tips and example

 The usage of AltCLIP is very similar to the CLIP. the difference between CLIP is the text encoder. Note that we use bidirectional attention instead of casual attention
 and we take the [CLS] token in XLM-R to represent text embedding.
@ -50,7 +52,6 @@ The [`AltCLIPProcessor`] wraps a [`CLIPImageProcessor`] and a [`XLMRobertaTokeni
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
 [`AltCLIPProcessor`] and [`AltCLIPModel`].

-
 ```python
 >>> from PIL import Image
 >>> import requests
@ -70,11 +71,11 @@ encode the text and prepare the images. The following example shows how to get t
 >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
 ```

-Tips:
+<Tip>

-This model is build on `CLIPModel`, so use it like a original CLIP. 
+This model is based on `CLIPModel`, use it like you would use the original [CLIP](clip).

-This model was contributed by [jongjyh](https://huggingface.co/jongjyh).
+</Tip>

 ## AltCLIPConfig

--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -26,15 +26,6 @@ The abstract from the paper is the following:

 *In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.*

-Tips:
-
- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
-sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
-mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
-the authors compute the stats for a downstream dataset.
- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
-[PSLA paper](https://arxiv.org/abs/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
 alt="drawing" width="600"/>

@ -43,6 +34,15 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/YuanGongND/ast).

+## Usage tips
+
+- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
+sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
+mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
+the authors compute the stats for a downstream dataset.
+- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
+[PSLA paper](https://arxiv.org/abs/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer.
--- a/docs/source/en/model_doc/autoformer.md
+++ b/docs/source/en/model_doc/autoformer.md
@ -39,13 +39,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] AutoformerConfig

-
 ## AutoformerModel

 [[autodoc]] AutoformerModel
    - forward

-
 ## AutoformerForPrediction

 [[autodoc]] AutoformerForPrediction
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -14,8 +14,7 @@ specific language governing permissions and limitations under the License.

 ## Overview

-Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark). 
-
+Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).

 Bark is made of 4 main models:

@ -26,6 +25,9 @@ Bark is made of 4 main models:

 It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

+This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
+The original code can be found [here](https://github.com/suno-ai/bark).
+
 ### Optimizing Bark

 Bark can be optimized with just a few extra lines of code, which **significantly reduces its memory footprint** and **accelerates inference**.
@ -86,7 +88,7 @@ model.enable_cpu_offload()

 Find out more on inference optimization techniques [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one).

-### Tips
+### Usage tips

 Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
 These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
@ -142,11 +144,6 @@ To save the audio, simply take the sample rate from the model config and some sc
 >>> write_wav("bark_generation.wav", sample_rate, audio_array)
 ```

-
-This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
-The original code can be found [here](https://github.com/suno-ai/bark).
-
-
 ## BarkConfig

 [[autodoc]] BarkConfig
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -25,9 +25,6 @@ rendered properly in your Markdown viewer.
 </a>
 </div>

-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
-@patrickvonplaten
-
 ## Overview

 The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
@ -45,7 +42,9 @@ According to the abstract,
  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
  of up to 6 ROUGE.

-Tips:
+This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).
+
+## Usage tips:

 - BART is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
@ -57,18 +56,6 @@ Tips:
  * permute sentences
  * rotate the document to make it start at a specific token

-This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).
-
-
-### Examples
-
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
-  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets`
-  object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904).
- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002).
-
-
 ## Implementation Notes

 - Bart doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or
@ -112,6 +99,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - [`BartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb).
 - [`TFBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb).
 - [`FlaxBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization).
+- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets` object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904)
 - [Summarization](https://huggingface.co/course/chapter7/5?fw=pt#summarization) chapter of the 🤗 Hugging Face course.
 - [Summarization task guide](../tasks/summarization)

@ -134,6 +122,7 @@ See also:
 - [Text classification task guide](../tasks/sequence_classification)
 - [Question answering task guide](../tasks/question_answering)
 - [Causal language modeling task guide](../tasks/language_modeling)
+- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002).

 ## BartConfig

@ -150,6 +139,10 @@ See also:
 [[autodoc]] BartTokenizerFast
    - all

+
+<frameworkcontent>
+<pt>
+
 ## BartModel

 [[autodoc]] BartModel
@ -175,6 +168,9 @@ See also:
 [[autodoc]] BartForCausalLM
    - forward

+</pt>
+<tf>
+
 ## TFBartModel

 [[autodoc]] TFBartModel
@ -190,6 +186,9 @@ See also:
 [[autodoc]] TFBartForSequenceClassification
    - call

+</tf>
+<jax>
+
 ## FlaxBartModel

 [[autodoc]] FlaxBartModel
@ -222,3 +221,8 @@ See also:

 [[autodoc]] FlaxBartForCausalLM
    - __call__
+</jax>
+</frameworkcontent>
+
+
+
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -38,8 +38,14 @@ provides a significant boost over vanilla BARThez, and is on par with or outperf

 This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez).

+<Tip> 

-### Examples
+BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on 
+configuration classes and their parameters. BARThez-specific tokenizers are documented below.  
+
+</Tip>
+
+## Resources

 - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -29,7 +29,9 @@ on a downstream task of Vietnamese text summarization show that in both automati
 outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future
 research and applications of generative Vietnamese NLP tasks.*

-Example of use:
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
+
+## Usage example

 ```python
 >>> import torch
@ -54,7 +56,7 @@ Example of use:
 >>> features = bartpho(**input_ids)
 ```

-Tips:
+## Usage tips

 - Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
  both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use
@ -79,8 +81,6 @@ Tips:
  Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
  segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".

-This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
-
 ## BartphoTokenizer

 [[autodoc]] BartphoTokenizer
--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -39,7 +39,10 @@ with previous pre-training methods. For example, base-size BEiT achieves 83.2% t
 significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*

-Tips:
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
+contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
+
+## Usage tips

 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
@ -68,9 +71,6 @@ alt="drawing" width="600"/>

 <small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small>

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
-contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
@ -107,6 +107,9 @@ If you're interested in submitting a resource to be included here, please feel f
    - preprocess
    - post_process_semantic_segmentation

+<frameworkcontent>
+<pt>
+
 ## BeitModel

 [[autodoc]] BeitModel
@ -127,6 +130,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] BeitForSemanticSegmentation
    - forward

+</pt>
+<jax>
+
 ## FlaxBeitModel

 [[autodoc]] FlaxBeitModel
@ -141,3 +147,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] FlaxBeitForImageClassification
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -33,10 +33,13 @@ GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the
 encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
 Text Summarization, Sentence Splitting, and Sentence Fusion.*

-Usage:
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
+found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder).

- The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained
-  BERT checkpoints for subsequent fine-tuning.
+## Usage examples and tips
+
+The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained BERT checkpoints for 
+subsequent fine-tuning:

 ```python
 >>> # leverage checkpoints for Bert2Bert model...
@ -61,8 +64,7 @@ Usage:
 >>> loss.backward()
 ```

- Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.,
-
+Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.:

 ```python
 >>> # instantiate sentence fusion model
@ -85,9 +87,6 @@ Tips:
 - For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
  Therefore, no EOS token should be added to the end of the input.

-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
-found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder).
-
 ## BertGenerationConfig

 [[autodoc]] BertGenerationConfig
--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -67,12 +67,16 @@ Example of using a model with Character tokenization:
 >>> outputs = bertjapanese(**inputs)
 ```

-Tips:
-
- This implementation is the same as BERT, except for tokenization method. Refer to the [documentation of BERT](bert) for more usage examples.
-
 This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).

+<Tip> 
+
+This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for 
+API reference information.  
+
+</Tip>
+
+
 ## BertJapaneseTokenizer

 [[autodoc]] BertJapaneseTokenizer
--- a/docs/source/en/model_doc/bert.md
+++ b/docs/source/en/model_doc/bert.md
@ -45,7 +45,9 @@ language processing tasks, including pushing the GLUE score to 80.5% (7.7% point
 accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute
 improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*

-Tips:
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert).
+
+## Usage tips

 - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
@ -59,10 +61,6 @@ Tips:
    
 - The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% they are not related. The model has to predict if the sentences are consecutive or not.

-
-
-This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@ -137,14 +135,23 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
    - create_token_type_ids_from_sequences
    - save_vocabulary

+<frameworkcontent>
+<pt>
+
 ## BertTokenizerFast

 [[autodoc]] BertTokenizerFast

+</pt>
+<tf>
+
 ## TFBertTokenizer

 [[autodoc]] TFBertTokenizer

+</tf>
+</frameworkcontent>
+
 ## Bert specific outputs

 [[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
@ -153,6 +160,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput

+
+<frameworkcontent>
+<pt>
+
 ## BertModel

 [[autodoc]] BertModel
@ -198,6 +209,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] BertForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFBertModel

 [[autodoc]] TFBertModel
@ -243,6 +257,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFBertForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxBertModel

 [[autodoc]] FlaxBertModel
@ -287,3 +304,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxBertForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -28,7 +28,9 @@ al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-
 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
 Part-of-speech tagging, Named-entity recognition and text classification.*

-Example of use:
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
+
+## Usage example

 ```python
 >>> import torch
@ -55,7 +57,12 @@ Example of use:
 >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 ```

-This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
+<Tip> 
+
+This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for 
+API reference information.  
+
+</Tip>

 ## BertweetTokenizer

--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -41,7 +41,10 @@ sequence as part of the sparse attention mechanism. The proposed sparse attentio
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*

-Tips:
+This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found
+[here](https://github.com/google-research/bigbird).
+
+## Usage tips

 - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
@ -53,10 +56,8 @@ Tips:
 - BigBird is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.

-This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found
-[here](https://github.com/google-research/bigbird).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -85,6 +86,9 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta

 [[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput

+<frameworkcontent>
+<pt>
+
 ## BigBirdModel

 [[autodoc]] BigBirdModel
@ -125,6 +129,9 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta
 [[autodoc]] BigBirdForQuestionAnswering
    - forward

+</pt>
+<jax>
+
 ## FlaxBigBirdModel

 [[autodoc]] FlaxBigBirdModel
@ -164,3 +171,8 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta

 [[autodoc]] FlaxBigBirdForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -41,7 +41,9 @@ sequence as part of the sparse attention mechanism. The proposed sparse attentio
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*

-Tips:
+The original code can be found [here](https://github.com/google-research/bigbird).
+
+## Usage tips

 - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
@ -54,9 +56,7 @@ Tips:
 - BigBird is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.

-The original code can be found [here](https://github.com/google-research/bigbird).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Question answering task guide](../tasks/question_answering)
--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -25,15 +25,15 @@ The abstract from the paper is the following:

 *Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.*

-Tips:
+This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/BioGPT).

- BioGPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
+## Usage tips
+
+- BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
 - BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
 - The model can take the `past_key_values` (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.

-This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/BioGPT).
-
-## Documentation resources
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/bit.md
+++ b/docs/source/en/model_doc/bit.md
@ -25,15 +25,15 @@ The abstract from the paper is the following:

 *Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.*

-Tips:
+This model was contributed by [nielsr](https://huggingface.co/nielsr).
+The original code can be found [here](https://github.com/google-research/big_transfer).
+
+## Usage tips

 - BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://arxiv.org/abs/1803.08494),
 2) [weight standardization](https://arxiv.org/abs/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
 impact on transfer learning.

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/google-research/big_transfer).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BiT.
@ -62,5 +62,4 @@ If you're interested in submitting a resource to be included here, please feel f
 ## BitForImageClassification

 [[autodoc]] BitForImageClassification
-    - forward
-
+    - forward
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -40,15 +40,16 @@ and code publicly available. Human evaluations show our best models are superior
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*

-Tips:
-
- Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the left.
-
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
 found [here](https://github.com/facebookresearch/ParlAI).

-## Documentation resources
+## Usage tips
+
+Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than 
+the left.
+
+
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Translation task guide](../tasks/translation)
@ -70,6 +71,9 @@ found [here](https://github.com/facebookresearch/ParlAI).

 [[autodoc]] BlenderbotSmallTokenizerFast

+<frameworkcontent>
+<pt>
+
 ## BlenderbotSmallModel

 [[autodoc]] BlenderbotSmallModel
@ -85,6 +89,9 @@ found [here](https://github.com/facebookresearch/ParlAI).
 [[autodoc]] BlenderbotSmallForCausalLM
    - forward

+</pt>
+<tf>
+
 ## TFBlenderbotSmallModel

 [[autodoc]] TFBlenderbotSmallModel
@ -95,6 +102,9 @@ found [here](https://github.com/facebookresearch/ParlAI).
 [[autodoc]] TFBlenderbotSmallForConditionalGeneration
    - call

+</tf>
+<jax>
+
 ## FlaxBlenderbotSmallModel

 [[autodoc]] FlaxBlenderbotSmallModel
@ -108,3 +118,6 @@ found [here](https://github.com/facebookresearch/ParlAI).
    - __call__
    - encode
    - decode
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -16,8 +16,6 @@ rendered properly in your Markdown viewer.

 # Blenderbot

-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) .
-
 ## Overview

 The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
@ -36,26 +34,14 @@ and code publicly available. Human evaluations show our best models are superior
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*

-Tips:
-
- Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the left.
-
 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .

+## Usage tips and example

-## Implementation Notes
+Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right 
+rather than the left.

- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture.
- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
-  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
-  [BlenderbotSmall](blenderbot-small).
-
-
-## Usage
-
-Here is an example of model usage:
+An example:

 ```python
 >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
@ -70,7 +56,16 @@ Here is an example of model usage:
 ["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
 ```

-## Documentation resources
+## Implementation Notes
+
+- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture.
+- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
+- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
+  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
+  [BlenderbotSmall](blenderbot-small).
+
+  
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Translation task guide](../tasks/translation)
@ -90,9 +85,13 @@ Here is an example of model usage:
 [[autodoc]] BlenderbotTokenizerFast
    - build_inputs_with_special_tokens

+
+<frameworkcontent>
+<pt>
+
 ## BlenderbotModel

-See `transformers.BartModel` for arguments to *forward* and *generate*
+See [`~transformers.BartModel`] for arguments to *forward* and *generate*

 [[autodoc]] BlenderbotModel
    - forward
@ -109,6 +108,9 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an
 [[autodoc]] BlenderbotForCausalLM
    - forward

+</pt>
+<tf>
+
 ## TFBlenderbotModel

 [[autodoc]] TFBlenderbotModel
@ -119,6 +121,9 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an
 [[autodoc]] TFBlenderbotForConditionalGeneration
    - call

+</tf>
+<jax>
+
 ## FlaxBlenderbotModel

 [[autodoc]] FlaxBlenderbotModel
@ -132,3 +137,8 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an
    - __call__
    - encode
    - decode
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/blip-2.md
+++ b/docs/source/en/model_doc/blip-2.md
@ -27,11 +27,6 @@ The abstract from the paper is the following:

 *The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.*

-Tips:
-
- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method.
- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/blip2_architecture.jpg"
 alt="drawing" width="600"/> 

@ -40,6 +35,11 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/salesforce/LAVIS/tree/5ee63d688ba4cebff63acee04adaef2dee9af207).

+## Usage tips
+
+- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method.
+- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLIP-2.
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.

 The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

-BLIP is a model that is able to perform various multi-modal tasks including
+BLIP is a model that is able to perform various multi-modal tasks including:
 - Visual Question Answering 
 - Image-Text retrieval (Image-text matching)
 - Image Captioning
@ -39,7 +39,6 @@ The original code can be found [here](https://github.com/salesforce/BLIP).

 - [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset

-
 ## BlipConfig

 [[autodoc]] BlipConfig
@ -57,12 +56,14 @@ The original code can be found [here](https://github.com/salesforce/BLIP).

 [[autodoc]] BlipProcessor

-
 ## BlipImageProcessor

 [[autodoc]] BlipImageProcessor
    - preprocess

+<frameworkcontent>
+<pt>
+
 ## BlipModel

 [[autodoc]] BlipModel
@ -75,30 +76,29 @@ The original code can be found [here](https://github.com/salesforce/BLIP).
 [[autodoc]] BlipTextModel
    - forward

-
 ## BlipVisionModel

 [[autodoc]] BlipVisionModel
    - forward

-
 ## BlipForConditionalGeneration

 [[autodoc]] BlipForConditionalGeneration
    - forward

-
 ## BlipForImageTextRetrieval

 [[autodoc]] BlipForImageTextRetrieval
    - forward

-
 ## BlipForQuestionAnswering

 [[autodoc]] BlipForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFBlipModel

 [[autodoc]] TFBlipModel
@ -111,26 +111,24 @@ The original code can be found [here](https://github.com/salesforce/BLIP).
 [[autodoc]] TFBlipTextModel
    - call

-
 ## TFBlipVisionModel

 [[autodoc]] TFBlipVisionModel
    - call

-
 ## TFBlipForConditionalGeneration

 [[autodoc]] TFBlipForConditionalGeneration
    - call

-
 ## TFBlipForImageTextRetrieval

 [[autodoc]] TFBlipForImageTextRetrieval
    - call

-
 ## TFBlipForQuestionAnswering

 [[autodoc]] TFBlipForQuestionAnswering
-    - call
+    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -56,16 +56,20 @@ See also:
 [[autodoc]] BloomConfig
    - all

-## BloomModel
-
-[[autodoc]] BloomModel
-    - forward
-
 ## BloomTokenizerFast

 [[autodoc]] BloomTokenizerFast
    - all

+
+<frameworkcontent>
+<pt>
+
+## BloomModel
+
+[[autodoc]] BloomModel
+    - forward
+
 ## BloomForCausalLM

 [[autodoc]] BloomForCausalLM
@ -86,6 +90,9 @@ See also:
 [[autodoc]] BloomForQuestionAnswering
    - forward

+</pt>
+<jax>
+
 ## FlaxBloomModel

 [[autodoc]] FlaxBloomModel
@ -95,3 +102,8 @@ See also:

 [[autodoc]] FlaxBloomForCausalLM
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/bort.md
+++ b/docs/source/en/model_doc/bort.md
@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

 <Tip warning={true}>

-This model is in maintenance mode only, so we won't accept any new PRs changing its code.
+This model is in maintenance mode only, we do not accept any new PRs changing its code.

 If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
 You can do so by running the following command: `pip install -U transformers==4.30.0`.
@ -43,13 +43,15 @@ hardware. It is also 7.9x faster on a CPU, as well as being better performing th
 architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
 absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*

-Tips:
+This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).

- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
-  model's API as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
+## Usage tips
+
+- BORT's model architecture is based on BERT, refer to [BERT's documentation page](bert) for the
+  model's API reference as well as usage examples.
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, refer to [RoBERTa's documentation page](roberta) for the tokenizer's API reference as well as usage examples.
 - BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
  algorithm to make BORT fine-tuning work.

-This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
+
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -37,7 +37,9 @@ alt="drawing" width="600"/>

 <small> BridgeTower architecture. Taken from the <a href="https://arxiv.org/abs/2206.08657">original paper.</a> </small>

-## Usage
+This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
+
+## Usage tips and examples

 BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
 The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
@ -116,9 +118,6 @@ The following example shows how to run masked language modeling using [`BridgeTo
 .a cat looking out of the window.
 ```

-This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
-
-
 Tips:

 - This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
--- a/docs/source/en/model_doc/bros.md
+++ b/docs/source/en/model_doc/bros.md
@ -31,12 +31,13 @@ AMLM is a 2D version of TMLM. It randomly masks text tokens and predicts with th

 BROS achieves comparable or better result on Key Information Extraction (KIE) benchmarks such as FUNSD, SROIE, CORD and SciTSR, without relying on explicit visual features.

-
 The abstract from the paper is the following:

 *Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods.*

-Tips:
+This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The original code can be found [here](https://github.com/clovaai/bros).
+
+## Usage tips and examples

 - [`~transformers.BrosModel.forward`] requires `input_ids` and `bbox` (bounding box). Each bounding box should be in (x0, y0, x1, y1) format (top-left corner, bottom-right corner). Obtaining of Bounding boxes depends on external OCR system. The `x` coordinate should be normalized by document image width, and the `y` coordinate should be normalized by document image height.

@ -78,9 +79,9 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):

 ```

- Demo scripts can be found [here](https://github.com/clovaai/bros).
+## Resources

-This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The original code can be found [here](https://github.com/clovaai/bros).
+- Demo scripts can be found [here](https://github.com/clovaai/bros).

 ## BrosConfig

@ -102,13 +103,11 @@ This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The
 [[autodoc]] BrosForTokenClassification
    - forward

-
 ## BrosSpadeEEForTokenClassification

 [[autodoc]] BrosSpadeEEForTokenClassification
    - forward

-
 ## BrosSpadeELForTokenClassification

 [[autodoc]] BrosSpadeELForTokenClassification
--- a/docs/source/en/model_doc/byt5.md
+++ b/docs/source/en/model_doc/byt5.md
@ -40,14 +40,18 @@ experiments.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
 found [here](https://github.com/google-research/byt5).

-ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
+<Tip>
+
+ByT5's architecture is based on the T5v1.1 model, refer to [T5v1.1's documentation page](t5v1.1) for the API reference. They
 only differ in how inputs should be prepared for the model, see the code examples below.

+</Tip>
+
 Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
 fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.


-### Example
+## Usage example

 ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:

--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -34,14 +34,16 @@ dependency parsing, named-entity recognition, and natural language inference. Ca
 for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
 downstream applications for French NLP.*

-Tips:
-
- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
-  as well as the information relative to the inputs and outputs.
-
 This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).

-## Documentation resources
+<Tip>
+
+This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well 
+as the information relative to the inputs and outputs.
+
+</Tip>
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -66,6 +68,9 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The

 [[autodoc]] CamembertTokenizerFast

+<frameworkcontent>
+<pt>
+
 ## CamembertModel

 [[autodoc]] CamembertModel
@ -94,6 +99,9 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The

 [[autodoc]] CamembertForQuestionAnswering

+</pt>
+<tf>
+
 ## TFCamembertModel

 [[autodoc]] TFCamembertModel
@ -121,3 +129,7 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The
 ## TFCamembertForQuestionAnswering

 [[autodoc]] TFCamembertForQuestionAnswering
+
+</tf>
+</frameworkcontent>
+
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -37,7 +37,9 @@ To use its finer-grained input effectively and efficiently, CANINE combines down
 sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*

-Tips:
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
+
+## Usage tips

 - CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
  layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
@ -50,19 +52,18 @@ Tips:
  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
  tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
  details for this can be found in the paper.
-  Models:
+
+Model checkpoints:

  - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
    12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
  - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
    768-hidden, 12-heads, 121M parameters (size ~500 MB).

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).

+## Usage example

-### Example
-
-CANINE works on raw characters, so it can be used without a tokenizer:
+CANINE works on raw characters, so it can be used **without a tokenizer**:

 ```python
 >>> from transformers import CanineModel
@ -96,17 +97,13 @@ sequences to the same length):
 >>> sequence_output = outputs.last_hidden_state
 ```

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)
 - [Multiple choice task guide](../tasks/multiple_choice)

-## CANINE specific outputs
-
-[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
-
 ## CanineConfig

 [[autodoc]] CanineConfig
@ -118,6 +115,10 @@ sequences to the same length):
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences

+## CANINE specific outputs
+
+[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
+
 ## CanineModel

 [[autodoc]] CanineModel
--- a/docs/source/en/model_doc/chinese_clip.md
+++ b/docs/source/en/model_doc/chinese_clip.md
@ -25,7 +25,9 @@ The abstract from the paper is the following:

 *The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.*

-## Usage
+The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys).
+
+## Usage example

 The code snippet below shows how to compute image & text features and similarities:

@ -59,15 +61,13 @@ The code snippet below shows how to compute image & text features and similariti
 >>> probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]
 ```

-Currently, we release the following scales of pretrained Chinese-CLIP models at HF Model Hub:
+Currently, following scales of pretrained Chinese-CLIP models are available on 🤗 Hub:

 - [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
 - [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14)
 - [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px)
 - [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)

-The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys). 
-
 ## ChineseCLIPConfig

 [[autodoc]] ChineseCLIPConfig
--- a/docs/source/en/model_doc/clap.md
+++ b/docs/source/en/model_doc/clap.md
@ -30,7 +30,6 @@ The abstract from the paper is the following:
 This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
 The original code can be found [here](https://github.com/LAION-AI/Clap).

-
 ## ClapConfig

 [[autodoc]] ClapConfig
@ -78,4 +77,3 @@ The original code can be found [here](https://github.com/LAION-AI/Clap).

 [[autodoc]] ClapAudioModelWithProjection
    - forward
-
--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
@ -40,7 +40,9 @@ for any dataset specific training. For instance, we match the accuracy of the or
 without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
 model weights at this https URL.*

-## Usage
+This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
+
+## Usage tips and example

 CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
 classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
@ -77,8 +79,6 @@ encode the text and prepare the images. The following example shows how to get t
 >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
 ```

-This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP.
@ -142,6 +142,9 @@ The resource should ideally demonstrate something new instead of duplicating an

 [[autodoc]] CLIPProcessor

+<frameworkcontent>
+<pt>
+
 ## CLIPModel

 [[autodoc]] CLIPModel
@ -164,12 +167,14 @@ The resource should ideally demonstrate something new instead of duplicating an
 [[autodoc]] CLIPVisionModelWithProjection
    - forward

-
 ## CLIPVisionModel

 [[autodoc]] CLIPVisionModel
    - forward

+</pt>
+<tf>
+
 ## TFCLIPModel

 [[autodoc]] TFCLIPModel
@ -187,6 +192,9 @@ The resource should ideally demonstrate something new instead of duplicating an
 [[autodoc]] TFCLIPVisionModel
    - call

+</tf>
+<jax>
+
 ## FlaxCLIPModel

 [[autodoc]] FlaxCLIPModel
@ -208,3 +216,6 @@ The resource should ideally demonstrate something new instead of duplicating an

 [[autodoc]] FlaxCLIPVisionModel
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@ -41,13 +41,6 @@ to any binary segmentation task where a text or image query
 can be formulated. Finally, we find our system to adapt well
 to generalized queries involving affordances or properties*

-Tips:
-
- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`].
- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text
-(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom
-conditional embeddings (provided to the model as `conditional_embeddings`).
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/clipseg_architecture.png"
 alt="drawing" width="600"/> 

@ -56,6 +49,13 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/timojl/clipseg).

+## Usage tips
+
+- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`].
+- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text
+(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom
+conditional embeddings (provided to the model as `conditional_embeddings`).
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIPSeg. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -24,7 +24,11 @@ The abstract from the paper is the following:

 *We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.*

-Check out all Code Llama models [here](https://huggingface.co/models?search=code_llama) and the officially released ones in the [codellama org](https://huggingface.co/codellama).
+Check out all Code Llama model checkpoints [here](https://huggingface.co/models?search=code_llama) and the officially released ones in the [codellama org](https://huggingface.co/codellama).
+
+This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
+
+## Usage tips and examples

 <Tip warning={true}>

@ -38,21 +42,22 @@ As mentioned above, the `dtype` of the storage weights is mostly irrelevant unle

 </Tip>

-Tips:

- These models have the same architecture as the `Llama2` models
+Tips:
 - The infilling task is supported out of the box. You should be using the `tokenizer.fill_token` where you want your input to be filled.
 - The model conversion script is the same as for the `Llama2` family:

-Here is a sample usage
+Here is a sample usage:
+
 ```bash
 python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
 ```
+
 Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions
 come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).

- After conversion, the model and tokenizer can be loaded via:
+After conversion, the model and tokenizer can be loaded via:

 ```python
 >>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer
@ -95,9 +100,13 @@ If you only want the infilled part:

 Under the hood, the tokenizer [automatically splits by `<FILL_ME>`](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token) to create a formatted input string that follows [the original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug.  To see how much CPU and GPU memory you need for this model or others, try [this calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) which can help determine that value.

- The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.
+The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.

-This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
+<Tip>
+
+Code Llama has the same architecture as the `Llama2` models, refer to [Llama2's documentation page](llama2) for the API reference.
+Find Code Llama tokenizer reference below. 
+</Tip>


 ## CodeLlamaTokenizer
--- a/docs/source/en/model_doc/codegen.md
+++ b/docs/source/en/model_doc/codegen.md
@ -40,7 +40,7 @@ The original code can be found [here](https://github.com/salesforce/codegen).
    * `mono`: Initialized with `multi`, then further pre-trained on Python data
 * For example, `Salesforce/codegen-350M-mono` offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.

-## How to use
+## Usage example

 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
@ -60,7 +60,7 @@ def hello_world():
 hello_world()
 ```

-## Documentation resources
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/conditional_detr.md
+++ b/docs/source/en/model_doc/conditional_detr.md
@ -31,7 +31,7 @@ alt="drawing" width="600"/>

 This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The original code can be found [here](https://github.com/Atten4Vis/ConditionalDETR).

-## Documentation resources
+## Resources

 - [Object detection task guide](../tasks/object_detection)

--- a/docs/source/en/model_doc/convbert.md
+++ b/docs/source/en/model_doc/convbert.md
@ -44,12 +44,14 @@ ConvBERT significantly outperforms BERT and its variants in various downstream t
 fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
 using less than 1/4 training cost. Code and pre-trained models will be released.*

-ConvBERT training tips are similar to those of BERT.
-
 This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
 here: https://github.com/yitu-opensource/ConvBert

-## Documentation resources
+## Usage tips
+
+ConvBERT training tips are similar to those of BERT. For usage tips refer to [BERT documentation](bert).
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -73,6 +75,9 @@ here: https://github.com/yitu-opensource/ConvBert

 [[autodoc]] ConvBertTokenizerFast

+<frameworkcontent>
+<pt>
+
 ## ConvBertModel

 [[autodoc]] ConvBertModel
@ -103,6 +108,9 @@ here: https://github.com/yitu-opensource/ConvBert
 [[autodoc]] ConvBertForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFConvBertModel

 [[autodoc]] TFConvBertModel
@ -132,3 +140,6 @@ here: https://github.com/yitu-opensource/ConvBert

 [[autodoc]] TFConvBertForQuestionAnswering
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/convnext.md
+++ b/docs/source/en/model_doc/convnext.md
@ -32,10 +32,6 @@ of a vision Transformer, and discover several key components that contribute to
 dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy
 and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.*

-Tips:
-
- See the code examples below each model regarding usage.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
 alt="drawing" width="600"/>

@ -68,6 +64,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] ConvNextImageProcessor
    - preprocess

+<frameworkcontent>
+<pt>
+
 ## ConvNextModel

 [[autodoc]] ConvNextModel
@ -78,14 +77,18 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] ConvNextForImageClassification
    - forward

+</pt>
+<tf>

 ## TFConvNextModel

 [[autodoc]] TFConvNextModel
    - call

-
 ## TFConvNextForImageClassification

 [[autodoc]] TFConvNextForImageClassification
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/convnextv2.md
+++ b/docs/source/en/model_doc/convnextv2.md
@ -25,10 +25,6 @@ The abstract from the paper is the following:

 *Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked  autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.*

-Tips:
-
- See the code examples below each model regarding usage.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnextv2_architecture.png"
 alt="drawing" width="600"/>

--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@ -37,7 +37,14 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.*
 This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate

-Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.
+
+<Tip>
+
+CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for 
+API reference information.  
+
+</Tip>
+

 ## CpmTokenizer

--- a/docs/source/en/model_doc/cpmant.md
+++ b/docs/source/en/model_doc/cpmant.md
@ -20,11 +20,10 @@ rendered properly in your Markdown viewer.

 CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live)

-Tips:
-
 This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).

-⚙️ Training & Inference
+## Resources
+
 - A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).

 ## CpmAntConfig
--- a/docs/source/en/model_doc/ctrl.md
+++ b/docs/source/en/model_doc/ctrl.md
@ -41,7 +41,10 @@ providing more explicit control over text generation. These codes also allow CTR
 training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
 via model-based source attribution.*

-Tips:
+This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
+[here](https://github.com/salesforce/ctrl).
+
+## Usage tips

 - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
  or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
@ -56,10 +59,8 @@ Tips:
  pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
  method for more information on the usage of this argument.

-This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
-[here](https://github.com/salesforce/ctrl).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Causal language modeling task guide](../tasks/language_modeling)
@ -73,6 +74,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
 [[autodoc]] CTRLTokenizer
    - save_vocabulary

+<frameworkcontent>
+<pt>
+
 ## CTRLModel

 [[autodoc]] CTRLModel
@ -88,6 +92,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
 [[autodoc]] CTRLForSequenceClassification
    - forward

+</pt>
+<tf>
+
 ## TFCTRLModel

 [[autodoc]] TFCTRLModel
@ -102,3 +109,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis

 [[autodoc]] TFCTRLForSequenceClassification
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/cvt.md
+++ b/docs/source/en/model_doc/cvt.md
@ -33,15 +33,15 @@ performance gains are maintained when pretrained on larger datasets (\eg ImageNe
 ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, 
 a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.*

-Tips:
+This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT).
+
+## Usage tips

 - CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
 - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
 - The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
  images and 1,000 classes).

-This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CvT.
@ -57,6 +57,9 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] CvtConfig

+<frameworkcontent>
+<pt>
+
 ## CvtModel

 [[autodoc]] CvtModel
@ -67,6 +70,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] CvtForImageClassification
    - forward

+</pt>
+<tf>
+
 ## TFCvtModel

 [[autodoc]] TFCvtModel
@ -77,3 +83,5 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] TFCvtForImageClassification
    - call

+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/data2vec.md
+++ b/docs/source/en/model_doc/data2vec.md
@ -35,19 +35,18 @@ the entire input. Experiments on the major benchmarks of speech recognition, ima
 natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
 Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec.*

-Tips:
-
- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
-
 This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).
 [sayakpaul](https://github.com/sayakpaul) and [Rocketknight1](https://github.com/Rocketknight1) contributed Data2Vec for vision in TensorFlow.

 The original code (for NLP and Speech) can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec).
 The original code for vision can be found [here](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).

+## Usage tips
+
+- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
+- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
+- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
+- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.

 ## Resources

@ -88,6 +87,8 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] Data2VecVisionConfig

+<frameworkcontent>
+<pt>

 ## Data2VecAudioModel

@ -164,6 +165,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] Data2VecVisionForSemanticSegmentation
    - forward

+</pt>
+<tf>
+
 ## TFData2VecVisionModel

 [[autodoc]] TFData2VecVisionModel
@ -178,3 +182,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] TFData2VecVisionForSemanticSegmentation
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/deberta-v2.md
+++ b/docs/source/en/model_doc/deberta-v2.md
@ -62,7 +62,7 @@ New in v2:
 This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -88,6 +88,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
    - build_inputs_with_special_tokens
    - create_token_type_ids_from_sequences

+<frameworkcontent>
+<pt>
+
 ## DebertaV2Model

 [[autodoc]] DebertaV2Model
@ -123,6 +126,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
 [[autodoc]] DebertaV2ForMultipleChoice
    - forward

+</pt>
+<tf>
+
 ## TFDebertaV2Model

 [[autodoc]] TFDebertaV2Model
@ -157,3 +163,6 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code

 [[autodoc]] TFDebertaV2ForMultipleChoice
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/deberta.md
+++ b/docs/source/en/model_doc/deberta.md
@ -94,6 +94,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
    - build_inputs_with_special_tokens
    - create_token_type_ids_from_sequences

+<frameworkcontent>
+<pt>
+
 ## DebertaModel

 [[autodoc]] DebertaModel
@ -123,6 +126,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] DebertaForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFDebertaModel

 [[autodoc]] TFDebertaModel
@ -152,3 +158,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] TFDebertaForQuestionAnswering
    - call
+
+</tf>
+</frameworkcontent>
+
--- a/docs/source/en/model_doc/decision_transformer.md
+++ b/docs/source/en/model_doc/decision_transformer.md
@ -33,9 +33,7 @@ This allows us to draw upon the simplicity and scalability of the Transformer ar
 Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on 
 Atari, OpenAI Gym, and Key-to-Door tasks.*

-Tips:
-
-This version of the model is for tasks where the state is a vector, image-based states will come soon.
+This version of the model is for tasks where the state is a vector.

 This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer).

--- a/docs/source/en/model_doc/deformable_detr.md
+++ b/docs/source/en/model_doc/deformable_detr.md
@ -25,11 +25,6 @@ The abstract from the paper is the following:

 *DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach.*

-Tips:
-
- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model.
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
 alt="drawing" width="600"/>

@ -37,6 +32,10 @@ alt="drawing" width="600"/>

 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/fundamentalvision/Deformable-DETR).

+## Usage tips
+
+- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Deformable DETR.
--- a/docs/source/en/model_doc/deit.md
+++ b/docs/source/en/model_doc/deit.md
@ -16,13 +16,6 @@ rendered properly in your Markdown viewer.

 # DeiT

-<Tip>
-
-This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
-
-</Tip>
-
 ## Overview

 The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
@ -45,7 +38,9 @@ distillation, especially when using a convnet as a teacher. This leads us to rep
 for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and
 models.*

-Tips:
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
+
+## Usage tips

 - Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the
  DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with
@ -73,8 +68,6 @@ Tips:
  *facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
  prepare images for the model.

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT.
@ -104,6 +97,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] DeiTImageProcessor
    - preprocess

+<frameworkcontent>
+<pt>
+
 ## DeiTModel

 [[autodoc]] DeiTModel
@ -124,6 +120,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] DeiTForImageClassificationWithTeacher
    - forward

+</pt>
+<tf>
+
 ## TFDeiTModel

 [[autodoc]] TFDeiTModel
@ -143,3 +142,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] TFDeiTForImageClassificationWithTeacher
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@ -24,12 +24,10 @@ The abstract of the paper states the following:

 *Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.*

-## Model description
-
 DePlot is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
 DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer.

-## Usage
+## Usage example

 Currently one checkpoint is available for DePlot:

@ -59,4 +57,10 @@ from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

 optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
-```
+```
+
+<Tip>
+
+DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
+
+</Tip>
--- a/docs/source/en/model_doc/deta.md
+++ b/docs/source/en/model_doc/deta.md
@ -26,10 +26,6 @@ The abstract from the paper is the following:

 *Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture.*

-Tips:
-
- One can use [`DetaImageProcessor`] to prepare images and optional targets for the model.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/deta_architecture.jpg"
 alt="drawing" width="600"/>

@ -51,20 +47,17 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DetaConfig

-
 ## DetaImageProcessor

 [[autodoc]] DetaImageProcessor
    - preprocess
    - post_process_object_detection

-
 ## DetaModel

 [[autodoc]] DetaModel
    - forward

-
 ## DetaForObjectDetection

 [[autodoc]] DetaForObjectDetection
--- a/docs/source/en/model_doc/detr.md
+++ b/docs/source/en/model_doc/detr.md
@ -41,6 +41,8 @@ baselines.*

 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/detr).

+## How DETR works
+
 Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works:

 First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use
@ -79,7 +81,7 @@ where one first trains a [`~transformers.DetrForObjectDetection`] model to detec
 the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is
 required for the training to be possible, since the Hungarian matching is computed using distances between boxes.

-Tips:
+## Usage tips

 - DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum
  number of objects that can be detected in a single image, and is set to 100 by default (see parameter
@ -165,14 +167,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

-## DETR specific outputs
-
-[[autodoc]] models.detr.modeling_detr.DetrModelOutput
-
-[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput
-
-[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput
-
 ## DetrConfig

 [[autodoc]] DetrConfig
@ -195,6 +189,14 @@ If you're interested in submitting a resource to be included here, please feel f
    - post_process_instance_segmentation
    - post_process_panoptic_segmentation

+## DETR specific outputs
+
+[[autodoc]] models.detr.modeling_detr.DetrModelOutput
+
+[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput
+
+[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput
+
 ## DetrModel

 [[autodoc]] DetrModel
--- a/docs/source/en/model_doc/dialogpt.md
+++ b/docs/source/en/model_doc/dialogpt.md
@ -32,7 +32,9 @@ that leverage DialoGPT generate more relevant, contentful and context-consistent
 systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
 generation and the development of more intelligent open-domain dialogue systems.*

-Tips:
+The original code can be found [here](https://github.com/microsoft/DialoGPT).
+
+## Usage tips

 - DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
  than the left.
@ -47,7 +49,8 @@ follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and
 modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
 sequence length), ended by the end-of-text token.* For more information please confer to the original paper.

+<Tip>

-DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2).
+DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples.

-The original code can be found [here](https://github.com/microsoft/DialoGPT).
+</Tip>
--- a/docs/source/en/model_doc/dinat.md
+++ b/docs/source/en/model_doc/dinat.md
@ -44,17 +44,6 @@ and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) an
 It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU),
 and ranks second on Cityscapes (84.5 mIoU) (no extra data). *

-Tips:
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
- DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
-it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
-
-Notes:
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
-You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
-Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
- Patch size of 4 is only supported at the moment.
-
 <img
 src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg"
 alt="drawing" width="600"/>
@ -65,6 +54,17 @@ Taken from the <a href="https://arxiv.org/abs/2209.15001">original paper</a>.</s
 This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr).
 The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer).

+## Usage tips
+
+DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
+it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
+
+Notes:
+- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
+You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
+Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
+- Patch size of 4 is only supported at the moment.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiNAT.
--- a/docs/source/en/model_doc/dinov2.md
+++ b/docs/source/en/model_doc/dinov2.md
@ -22,14 +22,9 @@ The abstract from the paper is the following:

 *The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.*

-Tips:
-
- One can use [`AutoImageProcessor`] class to prepare images for the model.
-
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/facebookresearch/dinov2).

-
 ## Dinov2Config

 [[autodoc]] Dinov2Config
--- a/docs/source/en/model_doc/distilbert.md
+++ b/docs/source/en/model_doc/distilbert.md
@ -51,7 +51,10 @@ distillation and cosine-distance losses. Our smaller, faster and lighter model i
 demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
 study.*

-Tips:
+This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
+contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
+
+## Usage tips

 - DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
  separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
@ -63,8 +66,6 @@ Tips:
    * predicting the masked tokens correctly (but no next-sentence objective)
    * a cosine similarity between the hidden states of the student and the teacher model

-This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
-contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).

 ## Resources

@ -144,6 +145,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] DistilBertTokenizerFast

+<frameworkcontent>
+<pt>
+
 ## DistilBertModel

 [[autodoc]] DistilBertModel
@ -174,6 +178,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] DistilBertForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFDistilBertModel

 [[autodoc]] TFDistilBertModel
@ -204,6 +211,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFDistilBertForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxDistilBertModel

 [[autodoc]] FlaxDistilBertModel
@ -233,3 +243,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxDistilBertForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
+
+
--- a/docs/source/en/model_doc/dit.md
+++ b/docs/source/en/model_doc/dit.md
@ -37,6 +37,10 @@ alt="drawing" width="600"/>

 <small> Summary of the approach. Taken from the [original paper](https://arxiv.org/abs/2203.02378). </small>

+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit).
+
+## Usage tips
+
 One can directly use the weights of DiT with the AutoModel API:

 ```python
@ -66,10 +70,6 @@ model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-fine
 This particular checkpoint was fine-tuned on [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/), an important benchmark for document image classification.
 A notebook that illustrates inference for document image classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DiT/Inference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb).

-As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks.
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiT.
@ -78,4 +78,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 - [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).

-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+
+<Tip>
+
+  As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks.
+</Tip>
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@ -34,14 +34,14 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
 [here](https://github.com/clovaai/donut).

-Tips:
+## Usage tips

 - The quickest way to get started with Donut is by checking the [tutorial
  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model
  at inference time as well as fine-tuning on custom data.
 - Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.

-## Inference
+## Inference examples

 Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
 [`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
--- a/docs/source/en/model_doc/dpr.md
+++ b/docs/source/en/model_doc/dpr.md
@ -43,7 +43,8 @@ benchmarks.*

 This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).

-Tips:
+## Usage tips
+
 - DPR consists in three models:

    * Question encoder: encode questions as vectors
@ -86,6 +87,9 @@ Tips:

 [[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput

+<frameworkcontent>
+<pt>
+
 ## DPRContextEncoder

 [[autodoc]] DPRContextEncoder
@ -101,6 +105,9 @@ Tips:
 [[autodoc]] DPRReader
    - forward

+</pt>
+<tf>
+
 ## TFDPRContextEncoder

 [[autodoc]] TFDPRContextEncoder
@ -115,3 +122,7 @@ Tips:

 [[autodoc]] TFDPRReader
    - call
+
+</tf>
+</frameworkcontent>
+
--- a/docs/source/en/model_doc/efficientformer.md
+++ b/docs/source/en/model_doc/efficientformer.md
@ -56,6 +56,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient
 [[autodoc]] EfficientFormerImageProcessor
    - preprocess

+<frameworkcontent>
+<pt>
+
 ## EfficientFormerModel

 [[autodoc]] EfficientFormerModel
@ -71,6 +74,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient
 [[autodoc]] EfficientFormerForImageClassificationWithTeacher
    - forward

+</pt>
+<tf>
+
 ## TFEfficientFormerModel

 [[autodoc]] TFEfficientFormerModel
@ -85,3 +91,6 @@ The original code can be found [here](https://github.com/snap-research/Efficient

 [[autodoc]] TFEfficientFormerForImageClassificationWithTeacher
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/electra.md
+++ b/docs/source/en/model_doc/electra.md
@ -50,7 +50,9 @@ using 30x more compute) on the GLUE natural language understanding benchmark. Ou
 where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
 using the same amount of compute.*

-Tips:
+This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra).
+
+## Usage tips

 - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
  only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
@ -66,9 +68,7 @@ Tips:
  [`ElectraForPreTraining`] model (the classification head will be randomly initialized as it
  doesn't exist in the generator).

-This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -95,6 +95,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o

 [[autodoc]] models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput

+<frameworkcontent>
+<pt>
+
 ## ElectraModel

 [[autodoc]] ElectraModel
@ -135,6 +138,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o
 [[autodoc]] ElectraForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFElectraModel

 [[autodoc]] TFElectraModel
@ -170,6 +176,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o
 [[autodoc]] TFElectraForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxElectraModel

 [[autodoc]] FlaxElectraModel
@ -209,3 +218,6 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o

 [[autodoc]] FlaxElectraForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/encodec.md
+++ b/docs/source/en/model_doc/encodec.md
@ -26,6 +26,9 @@ The abstract from the paper is the following:

 This model was contributed by [Matthijs](https://huggingface.co/Matthijs), [Patrick Von Platen](https://huggingface.co/patrickvonplaten) and [Arthur Zucker](https://huggingface.co/ArthurZ). 
 The original code can be found [here](https://github.com/facebookresearch/encodec).
+
+## Usage example 
+
 Here is a quick example of how to encode and decode an audio using this model:

 ```python 
@ -45,7 +48,6 @@ Here is a quick example of how to encode and decode an audio using this model:
 >>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
 ```

-
 ## EncodecConfig

 [[autodoc]] EncodecConfig
--- a/docs/source/en/model_doc/encoder-decoder.md
+++ b/docs/source/en/model_doc/encoder-decoder.md
@ -149,20 +149,32 @@ were contributed by [ydshieh](https://github.com/ydshieh).

 [[autodoc]] EncoderDecoderConfig

+<frameworkcontent>
+<pt>
+
 ## EncoderDecoderModel

 [[autodoc]] EncoderDecoderModel
    - forward
    - from_encoder_decoder_pretrained

+</pt>
+<tf>
+
 ## TFEncoderDecoderModel

 [[autodoc]] TFEncoderDecoderModel
    - call
    - from_encoder_decoder_pretrained

+</tf>
+<jax>
+
 ## FlaxEncoderDecoderModel

 [[autodoc]] FlaxEncoderDecoderModel
    - __call__
    - from_encoder_decoder_pretrained
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/ernie.md
+++ b/docs/source/en/model_doc/ernie.md
@ -23,7 +23,7 @@ including [ERNIE1.0](https://arxiv.org/abs/1904.09223), [ERNIE2.0](https://ojs.a

 These models are contributed by [nghuyong](https://huggingface.co/nghuyong) and the official code can be found in [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) (in PaddlePaddle).

-### How to use
+### Usage example
 Take `ernie-1.0-base-zh` as an example:

 ```Python
@ -32,7 +32,7 @@ tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh")
 model = AutoModel.from_pretrained("nghuyong/ernie-1.0-base-zh")
 ```

-### Supported Models
+### Model checkpoints

 |     Model Name      | Language |           Description           |
 |:-------------------:|:--------:|:-------------------------------:|
@ -51,7 +51,7 @@ You can find all the supported models from huggingface's model hub: [huggingface
 repo: [PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/ERNIE/contents.html)
 and [ERNIE](https://github.com/PaddlePaddle/ERNIE/blob/repro).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
--- a/docs/source/en/model_doc/ernie_m.md
+++ b/docs/source/en/model_doc/ernie_m.md
@ -25,18 +25,17 @@ Hao Tian, Hua Wu, Haifeng Wang.
 The abstract from the paper is the following:

 *Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for lowresource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.*
-
-Tips:
-
-1. Ernie-M is a BERT-like model so it is a stacked Transformer Encoder.
-2. Instead of using MaskedLM for pretraining (like BERT) the authors used two novel techniques: `Cross-attention Masked Language Modeling` and `Back-translation Masked Language Modeling`. For now these two LMHead objectives are not implemented here.
-3. It is a multilingual language model.
-4. Next Sentence Prediction was not used in pretraining process.
-
-
 This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_m).

-## Documentation resources
+
+## Usage tips
+
+- Ernie-M is a BERT-like model so it is a stacked Transformer Encoder.
+- Instead of using MaskedLM for pretraining (like BERT) the authors used two novel techniques: `Cross-attention Masked Language Modeling` and `Back-translation Masked Language Modeling`. For now these two LMHead objectives are not implemented here.
+- It is a multilingual language model.
+- Next Sentence Prediction was not used in pretraining process.
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
--- a/docs/source/en/model_doc/esm.md
+++ b/docs/source/en/model_doc/esm.md
@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
 # ESM

 ## Overview
+
 This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental 
 AI Research Team, providing the state-of-the-art ESMFold and ESM-2, and the previously released ESM-1b and ESM-1v.
 Transformer protein language models were introduced in the paper [Biological structure and function emerge from scaling
@ -73,11 +74,6 @@ sequences with low perplexity that are well understood by the language model. ES
 order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic
 proteins in practical timescales.*

-
-Tips:
-
- ESM models are trained with a masked language modeling (MLM) objective.
-
 The original code can be found [here](https://github.com/facebookresearch/esm) and was
 was developed by the Fundamental AI Research team at Meta AI.
 ESM-1b, ESM-1v and ESM-2 were contributed to huggingface by [jasonliu](https://huggingface.co/jasonliu)
@ -87,10 +83,12 @@ ESMFold was contributed to huggingface by [Matt](https://huggingface.co/Rocketkn
 [Sylvain](https://huggingface.co/sgugger), with a big thank you to Nikita Smetanin, Roshan Rao and Tom Sercu for their
 help throughout the process!

-The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library.
-The `openfold` library is licensed under the Apache License 2.0.
+## Usage tips

-## Documentation resources
+- ESM models are trained with a masked language modeling (MLM) objective.
+- The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library. The `openfold` library is licensed under the Apache License 2.0.
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -109,6 +107,8 @@ The `openfold` library is licensed under the Apache License 2.0.
    - create_token_type_ids_from_sequences
    - save_vocabulary

+<frameworkcontent>
+<pt>

 ## EsmModel

@ -135,6 +135,9 @@ The `openfold` library is licensed under the Apache License 2.0.
 [[autodoc]] EsmForProteinFolding
    - forward

+</pt>
+<tf>
+
 ## TFEsmModel

 [[autodoc]] TFEsmModel
@ -154,3 +157,6 @@ The `openfold` library is licensed under the Apache License 2.0.

 [[autodoc]] TFEsmForTokenClassification
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/flan-t5.md
+++ b/docs/source/en/model_doc/flan-t5.md
@ -48,6 +48,10 @@ Google has released the following variants:

 - [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl).

-One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks. As well as the FLAN-T5 model card for more details regarding training and evaluation of the model.
-
 The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints).
+
+<Tip>
+
+Refer to [T5's documentation page](t5) for all API reference, code examples and notebooks. For more details regarding training and evaluation of the FLAN-T5, refer to the model card.
+
+</Tip>
--- a/docs/source/en/model_doc/flan-ul2.md
+++ b/docs/source/en/model_doc/flan-ul2.md
@ -21,7 +21,6 @@ rendered properly in your Markdown viewer.
 Flan-UL2 is an encoder decoder model based on the T5 architecture. It uses the same configuration as the [UL2](ul2) model released earlier last year. 
 It was fine tuned using the "Flan" prompt tuning and dataset collection. Similar to `Flan-T5`,  one can directly use FLAN-UL2 weights without finetuning the model:

-
 According to the original blog here are the notable improvements:

 - The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
@ -29,9 +28,6 @@ According to the original blog here are the notable improvements:
 - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
 Google has released the following variants:

-
-One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks. As well as the FLAN-T5 model card for more details regarding training and evaluation of the model.
-
 The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints).


@ -51,6 +47,8 @@ The model is pretty heavy (~40GB in half precision) so if you just want to run t
 ['In a large skillet, brown the ground beef and onion over medium heat. Add the garlic']
 ```

-## Inference
+<Tip>

-The inference protocol is exactly the same as any `T5` model, please have a look at the [T5's documentation page](t5) for more details.
+Refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks. 
+
+</Tip>
--- a/docs/source/en/model_doc/flaubert.md
+++ b/docs/source/en/model_doc/flaubert.md
@ -50,7 +50,7 @@ This model was contributed by [formiel](https://huggingface.co/formiel). The ori
 Tips:
 - Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -66,6 +66,9 @@ Tips:

 [[autodoc]] FlaubertTokenizer

+<frameworkcontent>
+<pt>
+
 ## FlaubertModel

 [[autodoc]] FlaubertModel
@ -101,6 +104,9 @@ Tips:
 [[autodoc]] FlaubertForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFFlaubertModel

 [[autodoc]] TFFlaubertModel
@ -130,3 +136,9 @@ Tips:

 [[autodoc]] TFFlaubertForQuestionAnsweringSimple
    - call
+
+</tf>
+</frameworkcontent>
+
+
+
--- a/docs/source/en/model_doc/flava.md
+++ b/docs/source/en/model_doc/flava.md
@ -33,10 +33,8 @@ at once -- a true vision and language foundation model should be good at vision
 cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
 impressive performance on a wide range of 35 tasks spanning these target modalities.*

-
 This model was contributed by [aps](https://huggingface.co/aps). The original code can be found [here](https://github.com/facebookresearch/multimodal/tree/main/examples/flava).

-
 ## FlavaConfig

 [[autodoc]] FlavaConfig
--- a/docs/source/en/model_doc/fnet.md
+++ b/docs/source/en/model_doc/fnet.md
@ -37,15 +37,15 @@ sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finall
 and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models
 outperform Transformer counterparts.*

-Tips on usage:
-
- The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with
-  maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum
-  sequence length for fine-tuning and inference.
-
 This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/google-research/google-research/tree/master/f_net).

-## Documentation resources
+## Usage tips
+
+The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with 
+maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum 
+sequence length for fine-tuning and inference.
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
--- a/docs/source/en/model_doc/focalnet.md
+++ b/docs/source/en/model_doc/focalnet.md
@ -27,14 +27,9 @@ The abstract from the paper is the following:
 *We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its
 content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3.*

-Tips:
-
- One can use the [`AutoImageProcessor`] class to prepare images for the model.
-
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/microsoft/FocalNet).

-
 ## FocalNetConfig

 [[autodoc]] FocalNetConfig
--- a/docs/source/en/model_doc/fsmt.md
+++ b/docs/source/en/model_doc/fsmt.md
@ -16,9 +16,6 @@ rendered properly in your Markdown viewer.

 # FSMT

-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
-@stas00.
-
 ## Overview

 FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616) by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
--- a/docs/source/en/model_doc/funnel.md
+++ b/docs/source/en/model_doc/funnel.md
@ -47,7 +47,9 @@ via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer o
 a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading
 comprehension.*

-Tips:
+This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer).
+
+## Usage tips

 - Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers. This way, their length is divided by 2, which speeds up the computation of the next hidden states.
  The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
@ -62,9 +64,7 @@ Tips:
  [`FunnelBaseModel`], [`FunnelForSequenceClassification`] and
  [`FunnelForMultipleChoice`].

-This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@ -95,6 +95,9 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori

 [[autodoc]] models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput

+<frameworkcontent>
+<pt>
+
 ## FunnelBaseModel

 [[autodoc]] FunnelBaseModel
@ -135,6 +138,9 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori
 [[autodoc]] FunnelForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFFunnelBaseModel

 [[autodoc]] TFFunnelBaseModel
@ -174,3 +180,6 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori

 [[autodoc]] TFFunnelForQuestionAnswering
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/git.md
+++ b/docs/source/en/model_doc/git.md
@ -27,11 +27,6 @@ The abstract from the paper is the following:

 *In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.*

-Tips:
-
- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`.
- One can use [`GitProcessor`] to prepare images for the model, and the `generate` method for autoregressive generation.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/git_architecture.jpg"
 alt="drawing" width="600"/>

@ -40,6 +35,10 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/microsoft/GenerativeImage2Text).

+## Usage tips
+
+- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GIT.
--- a/docs/source/en/model_doc/glpn.md
+++ b/docs/source/en/model_doc/glpn.md
@ -33,10 +33,6 @@ The abstract from the paper is the following:

 *Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.*

-Tips:
-
- One can use [`GLPNImageProcessor`] to prepare images for the model.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
 alt="drawing" width="600"/>

--- a/docs/source/en/model_doc/gpt-sw3.md
+++ b/docs/source/en/model_doc/gpt-sw3.md
@ -32,12 +32,8 @@ causal language modeling (CLM) objective utilizing the NeMo Megatron GPT impleme

 This model was contributed by [AI Sweden](https://huggingface.co/AI-Sweden).

-The implementation uses the [GPT2Model](https://huggingface.co/docs/transformers/model_doc/gpt2) coupled
-with our `GPTSw3Tokenizer`. This means that `AutoTokenizer` and `AutoModelForCausalLM` map to our tokenizer
-implementation and the corresponding GPT2 model implementation respectively.
-*Note that sentencepiece is required to use our tokenizer and can be installed with:* `pip install transformers[sentencepiece]` or `pip install sentencepiece`
+## Usage example

-Example usage:
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForCausalLM

@ -52,12 +48,21 @@ Example usage:
 Träd är fina för att de är färgstarka. Men ibland är det fint
 ```

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Causal language modeling task guide](../tasks/language_modeling)

+<Tip>
+
+The implementation uses the `GPT2Model` coupled with our `GPTSw3Tokenizer`. Refer to [GPT2Model documentation](gpt2) 
+for API reference and examples.  
+
+Note that sentencepiece is required to use our tokenizer and can be installed with `pip install transformers[sentencepiece]` or `pip install sentencepiece`
+
+</Tip>
+
 ## GPTSw3Tokenizer

 [[autodoc]] GPTSw3Tokenizer
--- a/docs/source/en/model_doc/gpt2.md
+++ b/docs/source/en/model_doc/gpt2.md
@ -39,7 +39,13 @@ text. The diversity of the dataset causes this simple goal to contain naturally
 across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
 10X the amount of data.*

-Tips:
+[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by
+Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
+different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*.
+
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/).
+
+## Usage tips

 - GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
@ -54,12 +60,6 @@ Tips:
 - Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability
  improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only).

-[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by
-Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
-different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*.
-
-This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GPT2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@ -100,6 +100,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput

+<frameworkcontent>
+<pt>
+
 ## GPT2Model

 [[autodoc]] GPT2Model
@ -130,6 +133,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GPT2ForTokenClassification
    - forward

+</pt>
+<tf>
+
 ## TFGPT2Model

 [[autodoc]] TFGPT2Model
@ -158,6 +164,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] TFGPT2Tokenizer

+</tf>
+<jax>
+
 ## FlaxGPT2Model

 [[autodoc]] FlaxGPT2Model
@ -167,3 +176,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxGPT2LMHeadModel
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/gpt_bigcode.md
+++ b/docs/source/en/model_doc/gpt_bigcode.md
@ -20,13 +20,13 @@ rendered properly in your Markdown viewer.

 The GPTBigCode model was proposed in [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by BigCode. The listed authors are: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.

-The abstract from the paper is the following:uery
+The abstract from the paper is the following:

 *The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at [this https URL.](https://huggingface.co/bigcode)*

-The model is a an optimized [GPT2 model](https://huggingface.co/docs/transformers/model_doc/gpt2) with support for Multi-Query Attention.
+The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/model_doc/gpt2) with support for Multi-Query Attention.

-## Technical details
+## Implementation details

 The main differences compared to GPT2.
 - Added support for Multi-Query Attention.
@ -85,7 +85,6 @@ Below is a expected speedup diagram that compares pure inference time between th

 [[autodoc]] GPTBigCodeConfig

-
 ## GPTBigCodeModel

 [[autodoc]] GPTBigCodeModel
@ -96,7 +95,6 @@ Below is a expected speedup diagram that compares pure inference time between th
 [[autodoc]] GPTBigCodeForCausalLM
    - forward

-
 ## GPTBigCodeForSequenceClassification

 [[autodoc]] GPTBigCodeForSequenceClassification
--- a/docs/source/en/model_doc/gpt_neo.md
+++ b/docs/source/en/model_doc/gpt_neo.md
@ -27,7 +27,7 @@ The architecture is similar to GPT2 except that GPT Neo uses local attention in

 This model was contributed by [valhalla](https://huggingface.co/valhalla).

-### Generation
+## Usage example

 The `generate()` method can be used to generate text using GPT Neo model.

@ -54,7 +54,7 @@ The `generate()` method can be used to generate text using GPT Neo model.
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Causal language modeling task guide](../tasks/language_modeling)
@ -63,6 +63,10 @@ The `generate()` method can be used to generate text using GPT Neo model.

 [[autodoc]] GPTNeoConfig

+
+<frameworkcontent>
+<pt>
+
 ## GPTNeoModel

 [[autodoc]] GPTNeoModel
@ -88,6 +92,9 @@ The `generate()` method can be used to generate text using GPT Neo model.
 [[autodoc]] GPTNeoForTokenClassification
    - forward

+</pt>
+<jax>
+
 ## FlaxGPTNeoModel

 [[autodoc]] FlaxGPTNeoModel
@ -97,3 +104,8 @@ The `generate()` method can be used to generate text using GPT Neo model.

 [[autodoc]] FlaxGPTNeoForCausalLM
    - __call__
+
+</jax>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/gpt_neox.md
+++ b/docs/source/en/model_doc/gpt_neox.md
@ -38,7 +38,7 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cud
 GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates
 additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation.

-### Generation
+## Usage example

 The `generate()` method can be used to generate text using GPT Neo model.

@ -61,7 +61,7 @@ The `generate()` method can be used to generate text using GPT Neo model.
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```

-## Documentation resources
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/gpt_neox_japanese.md
+++ b/docs/source/en/model_doc/gpt_neox_japanese.md
@ -25,7 +25,7 @@ Following the recommendations from Google's research on [PaLM](https://ai.google

 Development of the model was led by [Shinya Otani](https://github.com/SO0529), [Takayoshi Makabe](https://github.com/spider-man-tm), [Anuj Arora](https://github.com/Anuj040), and [Kyo Hattori](https://github.com/go5paopao) from [ABEJA, Inc.](https://www.abejainc.com/). For more information on this model-building activity, please refer [here (ja)](https://tech-blog.abeja.asia/entry/abeja-gpt-project-202207).

-### Generation
+### Usage example

 The `generate()` method can be used to generate text using GPT NeoX Japanese model.

@ -51,7 +51,7 @@ The `generate()` method can be used to generate text using GPT NeoX Japanese mod
 人とAIが協調するためには、AIと人が共存し、AIを正しく理解する必要があります。
 ```

-## Documentation resources
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/gptj.md
+++ b/docs/source/en/model_doc/gptj.md
@ -23,7 +23,7 @@ causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.

 This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).

-Tips:
+## Usage tips

 - To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size
  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB
@ -56,7 +56,7 @@ Tips:
  size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens
  `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.

-### Generation
+## Usage examples

 The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J
 model.
@ -138,6 +138,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GPTJConfig
    - all

+<frameworkcontent>
+<pt>
+
 ## GPTJModel

 [[autodoc]] GPTJModel
@ -158,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GPTJForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFGPTJModel

 [[autodoc]] TFGPTJModel
@ -178,6 +184,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFGPTJForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxGPTJModel

 [[autodoc]] FlaxGPTJModel
@ -187,3 +196,5 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxGPTJForCausalLM
    - __call__
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/gptsan-japanese.md
+++ b/docs/source/en/model_doc/gptsan-japanese.md
@ -24,7 +24,7 @@ GPTSAN is a Japanese language model using Switch Transformer. It has the same st
 in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can
 fine-tune for translation or summarization.

-### Generation
+### Usage example

 The `generate()` method can be used to generate text using GPTSAN-Japanese model.

@ -56,7 +56,7 @@ This length applies to the text entered in `prefix_text` for the tokenizer.
 The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.
 The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.

-Tips:
+## Usage tips

 Specifying the Prefix part is done with a mask passed to self-attention.
 When token_type_ids=None or all zero, it is equivalent to regular causal mask
--- a/docs/source/en/model_doc/graphormer.md
+++ b/docs/source/en/model_doc/graphormer.md
@ -23,26 +23,24 @@ The abstract from the paper is the following:

 *The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*

-Tips:
+This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
+
+## Usage tips

 This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
 You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.

 This model does not use a tokenizer, but instead a special collator during training.

-This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
-
 ## GraphormerConfig

 [[autodoc]] GraphormerConfig

-
 ## GraphormerModel

 [[autodoc]] GraphormerModel
    - forward

-
 ## GraphormerForGraphClassification

 [[autodoc]] GraphormerForGraphClassification
--- a/docs/source/en/model_doc/groupvit.md
+++ b/docs/source/en/model_doc/groupvit.md
@ -25,13 +25,13 @@ The abstract from the paper is the following:

 *Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.*

-Tips:
-
- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. 
-
 This model was contributed by [xvjiarui](https://huggingface.co/xvjiarui). The TensorFlow version was contributed by [ariG23498](https://huggingface.co/ariG23498) with the help of [Yih-Dar SHIEH](https://huggingface.co/ydshieh), [Amy Roberts](https://huggingface.co/amyeroberts), and [Joao Gante](https://huggingface.co/joaogante).
 The original code can be found [here](https://github.com/NVlabs/GroupViT).

+## Usage tips
+ 
+- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. 
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT.
@ -52,6 +52,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] GroupViTVisionConfig

+<frameworkcontent>
+<pt>
+
 ## GroupViTModel

 [[autodoc]] GroupViTModel
@ -69,6 +72,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GroupViTVisionModel
    - forward

+</pt>
+<tf>
+
 ## TFGroupViTModel

 [[autodoc]] TFGroupViTModel
@ -84,4 +90,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## TFGroupViTVisionModel

 [[autodoc]] TFGroupViTVisionModel
-    - call
+    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/herbert.md
+++ b/docs/source/en/model_doc/herbert.md
@ -37,7 +37,11 @@ which has the best average performance and obtains the best results for three ou
 extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
 models.*

-Examples of use:
+This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
+[here](https://github.com/allegro/HerBERT).
+
+
+## Usage example

 ```python
 >>> from transformers import HerbertTokenizer, RobertaModel
@ -56,9 +60,12 @@ Examples of use:
 >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
 ```

-This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
-[here](https://github.com/allegro/HerBERT).
+<Tip>

+Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert) 
+for API reference and examples.  
+
+</Tip>

 ## HerbertTokenizer

--- a/docs/source/en/model_doc/hubert.md
+++ b/docs/source/en/model_doc/hubert.md
@ -36,15 +36,15 @@ state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-lig
 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
 reduction on the more challenging dev-other and test-other evaluation subsets.*

-Tips:
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
+
+# Usage tips

 - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
  using [`Wav2Vec2CTCTokenizer`].

-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
-
-## Documentation resources
+## Resources

 - [Audio classification task guide](../tasks/audio_classification)
 - [Automatic speech recognition task guide](../tasks/asr)
@ -53,6 +53,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv

 [[autodoc]] HubertConfig

+<frameworkcontent>
+<pt>
+
 ## HubertModel

 [[autodoc]] HubertModel
@ -68,6 +71,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] HubertForSequenceClassification
    - forward

+</pt>
+<tf>
+
 ## TFHubertModel

 [[autodoc]] TFHubertModel
@ -77,3 +83,6 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv

 [[autodoc]] TFHubertForCTC
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/ibert.md
+++ b/docs/source/en/model_doc/ibert.md
@ -40,7 +40,7 @@ been open-sourced.*

 This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
--- a/docs/source/en/model_doc/idefics.md
+++ b/docs/source/en/model_doc/idefics.md
@ -31,9 +31,9 @@ This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFace

 <Tip warning={true}>

-Idefics modeling code in Transformers is for finetuning and inferencing the pre-trained Idefics models.
+IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.

-To train a new Idefics model from scratch use the m4 codebase (a link will be provided once it's made public)
+To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)

 </Tip>

--- a/docs/source/en/model_doc/imagegpt.md
+++ b/docs/source/en/model_doc/imagegpt.md
@ -40,7 +40,7 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
 [here](https://github.com/openai/image-gpt).

-Tips:
+## Usage tips

 - ImageGPT is almost exactly the same as [GPT-2](gpt2), with the exception that a different activation
  function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
@ -92,7 +92,6 @@ If you're interested in submitting a resource to be included here, please feel f
 ## ImageGPTFeatureExtractor

 [[autodoc]] ImageGPTFeatureExtractor
-
    - __call__

 ## ImageGPTImageProcessor
@ -103,17 +102,14 @@ If you're interested in submitting a resource to be included here, please feel f
 ## ImageGPTModel

 [[autodoc]] ImageGPTModel
-
    - forward

 ## ImageGPTForCausalImageModeling

 [[autodoc]] ImageGPTForCausalImageModeling
-
    - forward

 ## ImageGPTForImageClassification

 [[autodoc]] ImageGPTForImageClassification
-
    - forward
--- a/docs/source/en/model_doc/informer.md
+++ b/docs/source/en/model_doc/informer.md
@ -39,13 +39,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] InformerConfig

-
 ## InformerModel

 [[autodoc]] InformerModel
    - forward

-
 ## InformerForPrediction

 [[autodoc]] InformerForPrediction
--- a/docs/source/en/model_doc/instructblip.md
+++ b/docs/source/en/model_doc/instructblip.md
@ -21,10 +21,6 @@ The abstract from the paper is the following:

 *General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.*

-Tips:
-
- InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/instructblip_architecture.jpg"
 alt="drawing" width="600"/>

@ -33,6 +29,9 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip).

+## Usage tips
+
+InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.

 ## InstructBlipConfig

--- a/docs/source/en/model_doc/jukebox.md
+++ b/docs/source/en/model_doc/jukebox.md
@ -32,7 +32,11 @@ The metadata such as *artist, genre and timing* are passed to each prior, in the

 ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)

-Tips:
+This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
+The original code can be found [here](https://github.com/openai/jukebox).
+
+## Usage tips
+
 - This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer!
 - This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`.
 - Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`.
@ -67,14 +71,12 @@ The original code can be found [here](https://github.com/openai/jukebox).
    - upsample
    - _sample

-
 ## JukeboxPrior

 [[autodoc]] JukeboxPrior
    - sample
    - forward

-
 ## JukeboxVQVAE

 [[autodoc]] JukeboxVQVAE
--- a/docs/source/en/model_doc/layoutlm.md
+++ b/docs/source/en/model_doc/layoutlm.md
@ -46,7 +46,7 @@ document-level pretraining. It achieves new state-of-the-art results in several
 understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
 (from 93.07 to 94.42).*

-Tips:
+## Usage tips

 - In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
@ -123,6 +123,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] LayoutLMTokenizerFast

+<frameworkcontent>
+<pt>
+
 ## LayoutLMModel

 [[autodoc]] LayoutLMModel
@ -143,6 +146,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] LayoutLMForQuestionAnswering

+</pt>
+<tf>
+
 ## TFLayoutLMModel

 [[autodoc]] TFLayoutLMModel
@ -162,3 +168,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## TFLayoutLMForQuestionAnswering

 [[autodoc]] TFLayoutLMForQuestionAnswering
+
+</tf>
+</frameworkcontent>
+
+
--- a/docs/source/en/model_doc/layoutlmv2.md
+++ b/docs/source/en/model_doc/layoutlmv2.md
@ -56,7 +56,7 @@ python -m pip install torchvision tesseract
 ```
 (If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.)

-Tips:
+## Usage tips

 - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
  pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
--- a/docs/source/en/model_doc/layoutlmv3.md
+++ b/docs/source/en/model_doc/layoutlmv3.md
@ -26,16 +26,6 @@ The abstract from the paper is the following:

 *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*

-Tips:
-
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
-    - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
-    - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
-  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png"
 alt="drawing" width="600"/>

@ -43,6 +33,14 @@ alt="drawing" width="600"/>

 This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3).

+## Usage tips
+
+- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
+    - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
+    - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
+  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
+- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
+
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@ -53,6 +51,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2

 </Tip>

+- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
+- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
+
 <PipelineTag pipeline="text-classification"/>

 - [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb).
@ -103,6 +104,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] LayoutLMv3Processor
    - __call__

+<frameworkcontent>
+<pt>
+
 ## LayoutLMv3Model

 [[autodoc]] LayoutLMv3Model
@ -123,6 +127,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] LayoutLMv3ForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFLayoutLMv3Model

 [[autodoc]] TFLayoutLMv3Model
@ -142,3 +149,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2

 [[autodoc]] TFLayoutLMv3ForQuestionAnswering
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/layoutxlm.md
+++ b/docs/source/en/model_doc/layoutxlm.md
@ -33,6 +33,10 @@ introduce a multilingual form understanding benchmark dataset named XFUN, which
 for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
 cross-lingual pre-trained models on the XFUN dataset.*

+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
+
+## Usage tips and examples
+
 One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:

 ```python
@ -56,10 +60,10 @@ Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally appl
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
 data for the model.

+<Tip>
+
 As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
-
+</Tip>

 ## LayoutXLMTokenizer

--- a/docs/source/en/model_doc/led.md
+++ b/docs/source/en/model_doc/led.md
@ -35,7 +35,7 @@ WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED),
 long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
 dataset.*

-Tips:
+## Usage tips

 - [`LEDForConditionalGeneration`] is an extension of
  [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
@ -52,15 +52,15 @@ Tips:
  errors. This can be done by executing `model.gradient_checkpointing_enable()`. 
 Moreover, the `use_cache=False`
  flag can be used to disable the caching mechanism to save memory.
- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
 - LED is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.

 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).

-## Documentation resources
+## Resources

+- [A notebook showing how to evaluate LED](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
+- [A notebook showing how to fine-tune LED](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
 - [Text classification task guide](../tasks/sequence_classification)
 - [Question answering task guide](../tasks/question_answering)
 - [Translation task guide](../tasks/translation)
@ -100,6 +100,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv

 [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput

+<frameworkcontent>
+<pt>
+
 ## LEDModel

 [[autodoc]] LEDModel
@ -120,6 +123,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] LEDForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFLEDModel

 [[autodoc]] TFLEDModel
@ -129,3 +135,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv

 [[autodoc]] TFLEDForConditionalGeneration
    - call
+
+</tf>
+</frameworkcontent>
+
+
+
--- a/docs/source/en/model_doc/levit.md
+++ b/docs/source/en/model_doc/levit.md
@ -38,7 +38,9 @@ alt="drawing" width="600"/>

 <small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>

-Tips:
+This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
+
+## Usage tips

 - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
 - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
@ -63,8 +65,6 @@ Tips:
 - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
  (you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).

-This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LeViT.
@ -90,7 +90,6 @@ If you're interested in submitting a resource to be included here, please feel f
  [[autodoc]] LevitImageProcessor
    - preprocess

-
 ## LevitModel

 [[autodoc]] LevitModel
--- a/Show More
+++ b/Show More