mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
.. | ||
README.md |
language |
---|
ms |
Bahasa T5 Model
Pretrained T5 base language model for Malay and Indonesian.
Pretraining Corpus
t5-base-bahasa-cased
model was pretrained on multiple tasks. Below is list of tasks we trained on,
- Unsupervised on local Wikipedia.
- Unsupervised on local news.
- Unsupervised on local parliament text.
- Unsupervised on IIUM Confession.
- Unsupervised on Wattpad.
- Unsupervised on Academia PDF.
- Next sentence prediction on local Wikipedia.
- Next sentence prediction on local news.
- Next sentence prediction on local parliament text.
- Next sentence prediction on IIUM Confession.
- Next sentence prediction on Wattpad.
- Next sentence prediction on Academia PDF.
- Bahasa SNLI.
- Bahasa Question Quora.
- Bahasa Natural Questions.
- News title summarization.
- Stemming to original wikipedia.
- Synonym to original wikipedia.
Preprocessing steps can reproduce from here, Malaya/pretrained-model/preprocess.
Pretraining details
- This model was trained using Google T5's github repository on v3-8 TPU.
- All steps can reproduce from here, Malaya/pretrained-model/t5.
Load Pretrained Model
You can use this model by installing torch
or tensorflow
and Huggingface library transformers
. And you can use it directly by initializing it like this:
from transformers import T5Tokenizer, T5Model
model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
Example using T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
Output is,
'Mahathir Mohamad'
Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
Acknowledgement
Thanks to Im Big, LigBlou, Mesolitica and KeyReply for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.