diff --git a/model_cards/zanelim/singbert-large-sg/README.md b/model_cards/zanelim/singbert-large-sg/README.md index a38f50db80e..e3be8882d13 100644 --- a/model_cards/zanelim/singbert-large-sg/README.md +++ b/model_cards/zanelim/singbert-large-sg/README.md @@ -13,17 +13,17 @@ datasets: - reddit singapore, malaysia - hardwarezone widget: -- text: "die [MASK] must try" - text: "kopi c siew [MASK]" +- text: "die [MASK] must try" --- # Model name -SingBert - Bert for Singlish (SG) and Manglish (MY). +SingBert Large - Bert for Singlish (SG) and Manglish (MY). ## Model description -Similar to [SingBert](https://huggingface.co/zanelim/singbert) but initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on +Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the large version, which was initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on [singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data. ## Intended uses & limitations diff --git a/model_cards/zanelim/singbert-lite-sg/README.md b/model_cards/zanelim/singbert-lite-sg/README.md new file mode 100644 index 00000000000..13819e064b6 --- /dev/null +++ b/model_cards/zanelim/singbert-lite-sg/README.md @@ -0,0 +1,168 @@ +--- +language: en +tags: +- singapore +- sg +- singlish +- malaysia +- ms +- manglish +- albert-base-v2 +license: mit +datasets: +- reddit singapore, malaysia +- hardwarezone +widget: +- text: "dont play [MASK] leh" +- text: "die [MASK] must try" +--- + +# Model name + +SingBert Lite - Bert for Singlish (SG) and Manglish (MY). + +## Model description + +Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the lite-version, which was initialized from [Albert base v2](https://github.com/google-research/albert#albert), with pre-training finetuned on +[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data. + +## Intended uses & limitations + +#### How to use + +```python +>>> from transformers import pipeline +>>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg') +>>> nlp("die [MASK] must try") + +[{'sequence': '[CLS] die die must try[SEP]', + 'score': 0.7731555700302124, + 'token': 1327, + 'token_str': '▁die'}, + {'sequence': '[CLS] die also must try[SEP]', + 'score': 0.04763784259557724, + 'token': 67, + 'token_str': '▁also'}, + {'sequence': '[CLS] die still must try[SEP]', + 'score': 0.01859409362077713, + 'token': 174, + 'token_str': '▁still'}, + {'sequence': '[CLS] die u must try[SEP]', + 'score': 0.015824034810066223, + 'token': 287, + 'token_str': '▁u'}, + {'sequence': '[CLS] die is must try[SEP]', + 'score': 0.011271446943283081, + 'token': 25, + 'token_str': '▁is'}] + +>>> nlp("dont play [MASK] leh") + +[{'sequence': '[CLS] dont play play leh[SEP]', + 'score': 0.4365769624710083, + 'token': 418, + 'token_str': '▁play'}, + {'sequence': '[CLS] dont play punk leh[SEP]', + 'score': 0.06880936771631241, + 'token': 6769, + 'token_str': '▁punk'}, + {'sequence': '[CLS] dont play game leh[SEP]', + 'score': 0.051739856600761414, + 'token': 250, + 'token_str': '▁game'}, + {'sequence': '[CLS] dont play games leh[SEP]', + 'score': 0.045703962445259094, + 'token': 466, + 'token_str': '▁games'}, + {'sequence': '[CLS] dont play around leh[SEP]', + 'score': 0.013458190485835075, + 'token': 140, + 'token_str': '▁around'}] + +>>> nlp("catch no [MASK]") + +[{'sequence': '[CLS] catch no ball[SEP]', + 'score': 0.6197211146354675, + 'token': 1592, + 'token_str': '▁ball'}, + {'sequence': '[CLS] catch no balls[SEP]', + 'score': 0.08441998809576035, + 'token': 7152, + 'token_str': '▁balls'}, + {'sequence': '[CLS] catch no joke[SEP]', + 'score': 0.0676785409450531, + 'token': 8186, + 'token_str': '▁joke'}, + {'sequence': '[CLS] catch no?[SEP]', + 'score': 0.040638409554958344, + 'token': 60, + 'token_str': '?'}, + {'sequence': '[CLS] catch no one[SEP]', + 'score': 0.03546864539384842, + 'token': 53, + 'token_str': '▁one'}] + +>>> nlp("confirm plus [MASK]") + +[{'sequence': '[CLS] confirm plus chop[SEP]', + 'score': 0.9608421921730042, + 'token': 17144, + 'token_str': '▁chop'}, + {'sequence': '[CLS] confirm plus guarantee[SEP]', + 'score': 0.011784233152866364, + 'token': 9120, + 'token_str': '▁guarantee'}, + {'sequence': '[CLS] confirm plus confirm[SEP]', + 'score': 0.010571340098977089, + 'token': 10265, + 'token_str': '▁confirm'}, + {'sequence': '[CLS] confirm plus egg[SEP]', + 'score': 0.0033525123726576567, + 'token': 6387, + 'token_str': '▁egg'}, + {'sequence': '[CLS] confirm plus bet[SEP]', + 'score': 0.0008760977652855217, + 'token': 5676, + 'token_str': '▁bet'}] + +``` + +Here is how to use this model to get the features of a given text in PyTorch: +```python +from transformers import AlbertTokenizer, AlbertModel +tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg') +model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg") +text = "Replace me by any text you'd like." +encoded_input = tokenizer(text, return_tensors='pt') +output = model(**encoded_input) +``` + +and in TensorFlow: +```python +from transformers import AlbertTokenizer, TFAlbertModel +tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg") +model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg") +text = "Replace me by any text you'd like." +encoded_input = tokenizer(text, return_tensors='tf') +output = model(encoded_input) +``` + +#### Limitations and bias +This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main +constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias. + +## Training data +Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew) +corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`. + +## Training procedure + +Initialized with [albert base v2](https://github.com/google-research/albert#albert) vocab and checkpoints (pre-trained weights). + +Pre-training was further finetuned on training data with the following hyperparameters +* train_batch_size: 4096 +* max_seq_length: 128 +* num_train_steps: 125000 +* num_warmup_steps: 5000 +* learning_rate: 0.00176 +* hardware: TPU v3-8 diff --git a/model_cards/zanelim/singbert/README.md b/model_cards/zanelim/singbert/README.md index 641f8facc95..bd5a0f96f20 100644 --- a/model_cards/zanelim/singbert/README.md +++ b/model_cards/zanelim/singbert/README.md @@ -13,8 +13,8 @@ datasets: - reddit singapore, malaysia - hardwarezone widget: -- text: "die [MASK] must try" - text: "kopi c siew [MASK]" +- text: "die [MASK] must try" --- # Model name