Updated BERTweet model card. (#37981)

* Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). * Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). * Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-07-03 21:00:08 +06:00 · 2025-05-28 00:21:22 +05:30 · 2025-05-28 00:21:22 +05:30 · 587c1b0ed1
commit 587c1b0ed1
parent b73faef52f
2 changed files with 58 additions and 36 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -386,7 +386,7 @@
      - local: model_doc/bert-japanese
        title: BertJapanese
      - local: model_doc/bertweet
-        title: Bertweet
+        title: BERTweet
      - local: model_doc/big_bird
        title: BigBird
      - local: model_doc/bigbird_pegasus
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -16,6 +16,7 @@ rendered properly in your Markdown viewer.
 # BERTweet
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
    <img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
@ -23,53 +24,74 @@ rendered properly in your Markdown viewer.
    ">
 </div>
-## Overview
+## BERTweet
-The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
+[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
 The abstract from the paper is the following:
-*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
+You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
 the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
 al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
 Part-of-speech tagging, Named-entity recognition and text classification.*
-This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
+> [!TIP]
 > Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.
-## Usage example
+The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
-```python
+<hfoptions id="usage">
->>> import torch
+<hfoption id="Pipeline">
 >>> from transformers import AutoModel, AutoTokenizer
->>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
+```py
 import torch
 from transformers import pipeline
->>> # For transformers v4.x+:
+pipeline = pipeline(
->>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
+    task="fill-mask",
    model="vinai/bertweet-base",
    torch_dtype=torch.float16,
    device=0
 )
 pipeline("Plants create <mask> through a process known as photosynthesis.")
 ```
 </hfoption>
 <hfoption id="AutoModel">
->>> # For transformers v3.x:
+```py
->>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
+import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer
->>> # INPUT TWEET IS ALREADY NORMALIZED!
+tokenizer = AutoTokenizer.from_pretrained(
->>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
+   "vinai/bertweet-base",
 )
 model = AutoModelForMaskedLM.from_pretrained(
    "vinai/bertweet-base",
    torch_dtype=torch.float16,
    device_map="auto"
 )
 inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
->>> input_ids = torch.tensor([tokenizer.encode(line)])
+with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits
->>> with torch.no_grad():
+masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-...     features = bertweet(input_ids)  # Models outputs are now tuples
+predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
 predicted_token = tokenizer.decode(predicted_token_id)
->>> # With TensorFlow 2.0+:
+print(f"The predicted token is: {predicted_token}")
 >>> # from transformers import TFAutoModel
 >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 ```
-<Tip> 
+</hfoption>
 <hfoption id="transformers CLI">
-This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for 
+```bash
-API reference information.  
+echo -e "Plants create <mask> through a process known as photosynthesis." | transformers-cli run --task fill-mask --model vinai/bertweet-base --device 0
 ```
-</Tip>
+</hfoption>
 </hfoptions>
 ## Notes
 - Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it’s preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
 - Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
 ## BertweetTokenizer