mirror of https://github.com/huggingface/transformers.git synced 2025-08-02 19:21:31 +06:00

History

husein zolkepli 7f23af1684 added electra model (cherry picked from commit `b5f2dc5d62`)		2020-04-20 17:17:58 -04:00
..
README.md	added electra model	2020-04-20 17:17:58 -04:00

README.md

language
malay

Bahasa ELECTRA Model

Pretrained ELECTRA small language model for Malay and Indonesian.

Pretraining Corpus

electra-small-generator-bahasa-cased model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,

Preprocessing steps can reproduce from here, Malaya/pretrained-model/preprocess.

Pretraining details

This model was trained using Google ELECTRA's github repository on a single TESLA V100 32GB VRAM.
All steps can reproduce from here, Malaya/pretrained-model/electra.

Load Pretrained Model

You can use this model by installing torch or tensorflow and Huggingface library transformers. And you can use it directly by initializing it like this:

from transformers import ElectraTokenizer, ElectraModel

model = ElectraModel.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
    'huseinzol05/electra-small-generator-bahasa-cased',
    do_lower_case = False,
)

Example using AutoModelWithLMHead

from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline

model = AutoModelWithLMHead.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
    'huseinzol05/electra-small-generator-bahasa-cased',
    do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))

Output is,

[{'sequence': '[CLS] makan ayam dengan ayam [SEP]',
  'score': 0.08424834907054901,
  'token': 3255},
 {'sequence': '[CLS] makan ayam dengan rendang [SEP]',
  'score': 0.064150370657444,
  'token': 6288},
 {'sequence': '[CLS] makan ayam dengan nasi [SEP]',
  'score': 0.033446669578552246,
  'token': 2533},
 {'sequence': '[CLS] makan ayam dengan kucing [SEP]',
  'score': 0.02803465723991394,
  'token': 3577},
 {'sequence': '[CLS] makan ayam dengan telur [SEP]',
  'score': 0.026627106592059135,
  'token': 6350}]

Results

For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.

Acknowledgement

Thanks to Im Big, LigBlou, Mesolitica and KeyReply for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa.