mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 18:22:34 +06:00
[model_cards] sarahlintang/IndoBERT (#7748)
* Create README.md * Update model_cards/sarahlintang/IndoBERT/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
This commit is contained in:
parent
ba654270b3
commit
3fdbeba83c
43
model_cards/sarahlintang/IndoBERT/README.md
Normal file
43
model_cards/sarahlintang/IndoBERT/README.md
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
language: id
|
||||||
|
datasets:
|
||||||
|
- oscar
|
||||||
|
---
|
||||||
|
# IndoBERT (Indonesian BERT Model)
|
||||||
|
|
||||||
|
## Model description
|
||||||
|
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.
|
||||||
|
|
||||||
|
This model is base-uncased version which use bert-base config.
|
||||||
|
|
||||||
|
## Intended uses & limitations
|
||||||
|
|
||||||
|
#### How to use
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
|
||||||
|
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
|
||||||
|
tokenizer.encode("hai aku mau makan.")
|
||||||
|
[2, 8078, 1785, 2318, 1946, 18, 4]
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Training data
|
||||||
|
|
||||||
|
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).
|
||||||
|
|
||||||
|
This model is equal to bert-base model which has 32,000 vocabulary size.
|
||||||
|
|
||||||
|
## Training procedure
|
||||||
|
|
||||||
|
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
|
||||||
|
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
|
||||||
|
|
||||||
|
## Eval results
|
||||||
|
|
||||||
|
We evaluate this model on three Indonesian NLP downstream task:
|
||||||
|
- some extractive summarization model
|
||||||
|
- sentiment analysis
|
||||||
|
- Part-of-Speech Tagger
|
||||||
|
it was proven that this model outperforms multilingual BERT for all downstream tasks.
|
Loading…
Reference in New Issue
Block a user