Updated BERTweet model card. (#37981)

* Updated BERTweet model card.

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* updated toctree (EN).

* Updated BERTweet model card.

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* updated toctree (EN).

* Updated BERTweet model card.

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bertweet.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* updated toctree (EN).

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
RogerSinghChugh 2025-05-28 00:21:22 +05:30 committed by GitHub
parent b73faef52f
commit 587c1b0ed1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 58 additions and 36 deletions

View File

@ -386,7 +386,7 @@
- local: model_doc/bert-japanese - local: model_doc/bert-japanese
title: BertJapanese title: BertJapanese
- local: model_doc/bertweet - local: model_doc/bertweet
title: Bertweet title: BERTweet
- local: model_doc/big_bird - local: model_doc/big_bird
title: BigBird title: BigBird
- local: model_doc/bigbird_pegasus - local: model_doc/bigbird_pegasus

View File

@ -16,6 +16,7 @@ rendered properly in your Markdown viewer.
# BERTweet # BERTweet
<div style="float: right;">
<div class="flex flex-wrap space-x-1"> <div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white"> <img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
@ -23,53 +24,74 @@ rendered properly in your Markdown viewer.
"> ">
</div> </div>
## Overview ## BERTweet
The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen. [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but its pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
The abstract from the paper is the following:
*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
Part-of-speech tagging, Named-entity recognition and text classification.*
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet). > [!TIP]
> Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.
## Usage example The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
```python <hfoptions id="usage">
>>> import torch <hfoption id="Pipeline">
>>> from transformers import AutoModel, AutoTokenizer
>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base") ```py
import torch
from transformers import pipeline
>>> # For transformers v4.x+: pipeline = pipeline(
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False) task="fill-mask",
model="vinai/bertweet-base",
torch_dtype=torch.float16,
device=0
)
pipeline("Plants create <mask> through a process known as photosynthesis.")
```
</hfoption>
<hfoption id="AutoModel">
>>> # For transformers v3.x: ```py
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base") import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
>>> # INPUT TWEET IS ALREADY NORMALIZED! tokenizer = AutoTokenizer.from_pretrained(
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:" "vinai/bertweet-base",
)
model = AutoModelForMaskedLM.from_pretrained(
"vinai/bertweet-base",
torch_dtype=torch.float16,
device_map="auto"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
>>> input_ids = torch.tensor([tokenizer.encode(line)]) with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
>>> with torch.no_grad(): masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
... features = bertweet(input_ids) # Models outputs are now tuples predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
>>> # With TensorFlow 2.0+: print(f"The predicted token is: {predicted_token}")
>>> # from transformers import TFAutoModel
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
``` ```
<Tip> </hfoption>
<hfoption id="transformers CLI">
This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for ```bash
API reference information. echo -e "Plants create <mask> through a process known as photosynthesis." | transformers-cli run --task fill-mask --model vinai/bertweet-base --device 0
```
</Tip> </hfoption>
</hfoptions>
## Notes
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because its preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
## BertweetTokenizer ## BertweetTokenizer