mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 21:00:08 +06:00
Updated BERTweet model card. (#37981)
* Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). * Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). * Updated BERTweet model card. * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/bertweet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * updated toctree (EN). --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
parent
b73faef52f
commit
587c1b0ed1
@ -386,7 +386,7 @@
|
|||||||
- local: model_doc/bert-japanese
|
- local: model_doc/bert-japanese
|
||||||
title: BertJapanese
|
title: BertJapanese
|
||||||
- local: model_doc/bertweet
|
- local: model_doc/bertweet
|
||||||
title: Bertweet
|
title: BERTweet
|
||||||
- local: model_doc/big_bird
|
- local: model_doc/big_bird
|
||||||
title: BigBird
|
title: BigBird
|
||||||
- local: model_doc/bigbird_pegasus
|
- local: model_doc/bigbird_pegasus
|
||||||
|
@ -16,6 +16,7 @@ rendered properly in your Markdown viewer.
|
|||||||
|
|
||||||
# BERTweet
|
# BERTweet
|
||||||
|
|
||||||
|
<div style="float: right;">
|
||||||
<div class="flex flex-wrap space-x-1">
|
<div class="flex flex-wrap space-x-1">
|
||||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||||
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
|
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
|
||||||
@ -23,53 +24,74 @@ rendered properly in your Markdown viewer.
|
|||||||
">
|
">
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
## Overview
|
## BERTweet
|
||||||
|
|
||||||
The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
|
[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
|
You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
|
||||||
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
|
|
||||||
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
|
|
||||||
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
|
|
||||||
Part-of-speech tagging, Named-entity recognition and text classification.*
|
|
||||||
|
|
||||||
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
|
> [!TIP]
|
||||||
|
> Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.
|
||||||
|
|
||||||
## Usage example
|
The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
|
||||||
|
|
||||||
```python
|
<hfoptions id="usage">
|
||||||
>>> import torch
|
<hfoption id="Pipeline">
|
||||||
>>> from transformers import AutoModel, AutoTokenizer
|
|
||||||
|
|
||||||
>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
|
```py
|
||||||
|
import torch
|
||||||
|
from transformers import pipeline
|
||||||
|
|
||||||
>>> # For transformers v4.x+:
|
pipeline = pipeline(
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
|
task="fill-mask",
|
||||||
|
model="vinai/bertweet-base",
|
||||||
|
torch_dtype=torch.float16,
|
||||||
|
device=0
|
||||||
|
)
|
||||||
|
pipeline("Plants create <mask> through a process known as photosynthesis.")
|
||||||
|
```
|
||||||
|
</hfoption>
|
||||||
|
<hfoption id="AutoModel">
|
||||||
|
|
||||||
>>> # For transformers v3.x:
|
```py
|
||||||
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
|
import torch
|
||||||
|
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
||||||
|
|
||||||
>>> # INPUT TWEET IS ALREADY NORMALIZED!
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
|
"vinai/bertweet-base",
|
||||||
|
)
|
||||||
|
model = AutoModelForMaskedLM.from_pretrained(
|
||||||
|
"vinai/bertweet-base",
|
||||||
|
torch_dtype=torch.float16,
|
||||||
|
device_map="auto"
|
||||||
|
)
|
||||||
|
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
|
||||||
|
|
||||||
>>> input_ids = torch.tensor([tokenizer.encode(line)])
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
predictions = outputs.logits
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
|
||||||
... features = bertweet(input_ids) # Models outputs are now tuples
|
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
|
||||||
|
predicted_token = tokenizer.decode(predicted_token_id)
|
||||||
|
|
||||||
>>> # With TensorFlow 2.0+:
|
print(f"The predicted token is: {predicted_token}")
|
||||||
>>> # from transformers import TFAutoModel
|
|
||||||
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
|
|
||||||
```
|
```
|
||||||
|
|
||||||
<Tip>
|
</hfoption>
|
||||||
|
<hfoption id="transformers CLI">
|
||||||
|
|
||||||
This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for
|
```bash
|
||||||
API reference information.
|
echo -e "Plants create <mask> through a process known as photosynthesis." | transformers-cli run --task fill-mask --model vinai/bertweet-base --device 0
|
||||||
|
```
|
||||||
|
|
||||||
</Tip>
|
</hfoption>
|
||||||
|
</hfoptions>
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it’s preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
|
||||||
|
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
|
||||||
|
|
||||||
## BertweetTokenizer
|
## BertweetTokenizer
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user