[model_cards] sarahlintang/IndoBERT (#7748)

* Create README.md * Update model_cards/sarahlintang/IndoBERT/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2025-07-31 18:22:34 +06:00 · 2020-10-15 00:10:31 +07:00 · 2020-10-15 00:10:31 +07:00 · 3fdbeba83c
commit 3fdbeba83c
parent ba654270b3
1 changed files with 43 additions and 0 deletions
--- a/model_cards/sarahlintang/IndoBERT/README.md
+++ b/model_cards/sarahlintang/IndoBERT/README.md
@ -0,0 +1,43 @@
 ---
 language: id
 datasets:
 - oscar
 ---
 # IndoBERT (Indonesian BERT Model)
 ## Model description
 IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. 
 This model is base-uncased version which use bert-base config.
 ## Intended uses & limitations
 #### How to use
 ```python
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
 model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
 tokenizer.encode("hai aku mau makan.")
 [2, 8078, 1785, 2318, 1946, 18, 4]
 ```
 ## Training data
 This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). 
 This model is equal to bert-base model which has 32,000 vocabulary size. 
 ## Training procedure
 The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
 We used a Google Cloud Storage bucket, for persistent storage of training data and models.
 ## Eval results
 We evaluate this model on three Indonesian NLP downstream task:
 - some extractive summarization model
 - sentiment analysis
 - Part-of-Speech Tagger
 it was proven that this model outperforms multilingual BERT for all downstream tasks.