[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
This commit is contained in:
Julien Chaumond 2020-12-12 00:24:42 +01:00 committed by GitHub
parent 29e4597950
commit 3552d0e0d8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
758 changed files with 50 additions and 45116 deletions

View File

@ -60,7 +60,7 @@ Basic steps
In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
users to clone it and you (and your organization members) to push to it.
You can create a model repo directly from the website, `here <https://huggingface.co/new>`.
You can create a model repo **directly from `the /new page on the website <https://huggingface.co/new>`__.**
Alternatively, you can use the ``transformers-cli``. The next steps describe that process:
@ -82,12 +82,12 @@ This creates a repo on the model hub, which can be cloned.
.. code-block:: bash
git clone https://huggingface.co/username/your-model-name
# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install
git clone https://huggingface.co/username/your-model-name
When you have your local clone of your repo and lfs installed, you can then add/remove from that clone as you would
with any other git repo.
@ -98,8 +98,12 @@ with any other git repo.
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
We are intentionally not wrapping git too much, so as to stay intuitive and easy-to-use.
We are intentionally not wrapping git too much, so that you can go on with the workflow you're used to and the tools
you already know.
The only learning curve you might have compared to regular git is the one for git-lfs. The documentation at
`git-lfs.github.com <https://git-lfs.github.com/>`__ is decent, but we'll work on a tutorial with some tips and tricks
in the coming weeks!
Make your model work on all frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -110,7 +114,7 @@ Make your model work on all frameworks
You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
super easy to do (and in a future version, it might all be automatic). You will need to install both PyTorch and
TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
@ -192,7 +196,7 @@ status`` command:
git add --all
git status
Finally, the files should be comitted:
Finally, the files should be committed:
.. code-block:: bash
@ -210,23 +214,20 @@ This will upload the folder containing the weights, tokenizer and configuration
Add a model card
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
placed in a subfolder with your username or organization, then another subfolder named like your model
(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
card template (meta-suggestions are welcome).
To make sure everyone knows what your model can do, what its limitations, potential bias or ethical considerations are,
please add a README.md model card to your model repo. You can just create it, or there's also a convenient button
titled "Add a README.md" on your model page. A model card template can be found `here
<https://github.com/huggingface/model_card>`__ (meta-suggestions are welcome). model card template (meta-suggestions
are welcome).
.. note::
Model cards used to live in the 🤗 Transformers repo under `model_cards/`, but for consistency and scalability we
migrated every model card from the repo to its corresponding huggingface.co model repo.
If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
don't forget to link to its model card so that people can fully trace how your model was built.
If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
to see the steps to follow.
.. note::
You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
inside `path/to/awesome-name-you-picked/`.
Using your model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -262,7 +263,8 @@ First you need to install `git-lfs` in the environment used by the notebook:
sudo apt-get install git-lfs
Then you can use the :obj:`transformers-cli` to create your new repo:
Then you can use either create a repo directly from `huggingface.co <https://huggingface.co/>`__ , or use the
:obj:`transformers-cli` to create it:
.. code-block:: bash
@ -274,13 +276,14 @@ Once it's created, you can clone it and configure it (replace username by your u
.. code-block:: bash
git lfs install
git clone https://username:password@huggingface.co/username/your-model-name
# Alternatively if you have a token,
# you can use it instead of your password
git clone https://username:token@huggingface.co/username/your-model-name
cd your-model-name
git lfs install
git config --global user.email "email@example.com"
# Tip: using the same email than for your huggingface.co account will link your commits to your profile
git config --global user.name "Your name"

View File

@ -1,20 +0,0 @@
---
language: ja
license: apache-2.0
---
## Japanese ELECTRA-small
We provide a Japanese **ELECTRA-Small** model, as described in [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
Our pretraining process employs subword units derived from the [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/latest), using the [Byte-Pair Encoding](https://www.aclweb.org/anthology/P16-1162.pdf) method and building on an initial tokenization with [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd). For optimal performance, please take care to set your MeCab dictionary appropriately.
## How to use the discriminator in `transformers`
```
from transformers import BertJapaneseTokenizer, ElectraForPreTraining
tokenizer = BertJapaneseTokenizer.from_pretrained('Cinnamon/electra-small-japanese-discriminator', mecab_kwargs={"mecab_option": "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"})
model = ElectraForPreTraining.from_pretrained('Cinnamon/electra-small-japanese-discriminator')
```

View File

@ -1,18 +0,0 @@
---
language: ja
---
## Japanese ELECTRA-small
We provide a Japanese **ELECTRA-Small** model, as described in [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
Our pretraining process employs subword units derived from the [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/latest), using the [Byte-Pair Encoding](https://www.aclweb.org/anthology/P16-1162.pdf) method and building on an initial tokenization with [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd). For optimal performance, please take care to set your MeCab dictionary appropriately.
```
# ELECTRA-small generator usage
from transformers import BertJapaneseTokenizer, ElectraForMaskedLM
tokenizer = BertJapaneseTokenizer.from_pretrained('Cinnamon/electra-small-japanese-generator', mecab_kwargs={"mecab_option": "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"})
model = ElectraForMaskedLM.from_pretrained('Cinnamon/electra-small-japanese-generator')
```

View File

@ -1,142 +0,0 @@
---
language: da
tags:
- bert
- masked-lm
license: cc-by-4.0
datasets:
- common_crawl
- wikipedia
pipeline_tag: fill-mask
widget:
- text: "København er [MASK] i Danmark."
---
# Danish BERT (uncased) model
[BotXO.ai](https://www.botxo.ai/) developed this model. For data and training details see their [GitHub repository](https://github.com/botxo/nordic_bert).
The original model was trained in TensorFlow then I converted it to Pytorch using [transformers-cli](https://huggingface.co/transformers/converting_tensorflow_models.html?highlight=cli).
For TensorFlow version download here: https://www.dropbox.com/s/19cjaoqvv2jicq9/danish_bert_uncased_v2.zip?dl=1
## Architecture
```python
from transformers import AutoModelForPreTraining
model = AutoModelForPreTraining.from_pretrained("DJSammy/bert-base-danish-uncased_BotXO,ai")
params = list(model.named_parameters())
print('danish_bert_uncased_v2 has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Last Transformer ====\n')
for p in params[181:197]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[197:]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
# danish_bert_uncased_v2 has 206 different named parameters.
# ==== Embedding Layer ====
# bert.embeddings.word_embeddings.weight (32000, 768)
# bert.embeddings.position_embeddings.weight (512, 768)
# bert.embeddings.token_type_embeddings.weight (2, 768)
# bert.embeddings.LayerNorm.weight (768,)
# bert.embeddings.LayerNorm.bias (768,)
# ==== First Transformer ====
# bert.encoder.layer.0.attention.self.query.weight (768, 768)
# bert.encoder.layer.0.attention.self.query.bias (768,)
# bert.encoder.layer.0.attention.self.key.weight (768, 768)
# bert.encoder.layer.0.attention.self.key.bias (768,)
# bert.encoder.layer.0.attention.self.value.weight (768, 768)
# bert.encoder.layer.0.attention.self.value.bias (768,)
# bert.encoder.layer.0.attention.output.dense.weight (768, 768)
# bert.encoder.layer.0.attention.output.dense.bias (768,)
# bert.encoder.layer.0.attention.output.LayerNorm.weight (768,)
# bert.encoder.layer.0.attention.output.LayerNorm.bias (768,)
# bert.encoder.layer.0.intermediate.dense.weight (3072, 768)
# bert.encoder.layer.0.intermediate.dense.bias (3072,)
# bert.encoder.layer.0.output.dense.weight (768, 3072)
# bert.encoder.layer.0.output.dense.bias (768,)
# bert.encoder.layer.0.output.LayerNorm.weight (768,)
# bert.encoder.layer.0.output.LayerNorm.bias (768,)
# ==== Last Transformer ====
# bert.encoder.layer.11.attention.self.query.weight (768, 768)
# bert.encoder.layer.11.attention.self.query.bias (768,)
# bert.encoder.layer.11.attention.self.key.weight (768, 768)
# bert.encoder.layer.11.attention.self.key.bias (768,)
# bert.encoder.layer.11.attention.self.value.weight (768, 768)
# bert.encoder.layer.11.attention.self.value.bias (768,)
# bert.encoder.layer.11.attention.output.dense.weight (768, 768)
# bert.encoder.layer.11.attention.output.dense.bias (768,)
# bert.encoder.layer.11.attention.output.LayerNorm.weight (768,)
# bert.encoder.layer.11.attention.output.LayerNorm.bias (768,)
# bert.encoder.layer.11.intermediate.dense.weight (3072, 768)
# bert.encoder.layer.11.intermediate.dense.bias (3072,)
# bert.encoder.layer.11.output.dense.weight (768, 3072)
# bert.encoder.layer.11.output.dense.bias (768,)
# bert.encoder.layer.11.output.LayerNorm.weight (768,)
# bert.encoder.layer.11.output.LayerNorm.bias (768,)
# ==== Output Layer ====
# bert.pooler.dense.weight (768, 768)
# bert.pooler.dense.bias (768,)
# cls.predictions.bias (32000,)
# cls.predictions.transform.dense.weight (768, 768)
# cls.predictions.transform.dense.bias (768,)
# cls.predictions.transform.LayerNorm.weight (768,)
# cls.predictions.transform.LayerNorm.bias (768,)
# cls.seq_relationship.weight (2, 768)
# cls.seq_relationship.bias (2,)
```
## Example Pipeline
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='DJSammy/bert-base-danish-uncased_BotXO,ai')
unmasker('København er [MASK] i Danmark.')
# Copenhagen is the [MASK] of Denmark.
# =>
# [{'score': 0.788068950176239,
# 'sequence': '[CLS] københavn er hovedstad i danmark. [SEP]',
# 'token': 12610,
# 'token_str': 'hovedstad'},
# {'score': 0.07606703042984009,
# 'sequence': '[CLS] københavn er hovedstaden i danmark. [SEP]',
# 'token': 8108,
# 'token_str': 'hovedstaden'},
# {'score': 0.04299738258123398,
# 'sequence': '[CLS] københavn er metropol i danmark. [SEP]',
# 'token': 23305,
# 'token_str': 'metropol'},
# {'score': 0.008163209073245525,
# 'sequence': '[CLS] københavn er ikke i danmark. [SEP]',
# 'token': 89,
# 'token_str': 'ikke'},
# {'score': 0.006238455418497324,
# 'sequence': '[CLS] københavn er ogsa i danmark. [SEP]',
# 'token': 25253,
# 'token_str': 'ogsa'}]
```

View File

@ -1,14 +0,0 @@
---
language:
- bg
- cs
- pl
- ru
---
# bert-base-bg-cs-pl-ru-cased
SlavicBERT\[1\] \(Slavic \(bg, cs, pl, ru\), cased, 12layer, 768hidden, 12heads, 180M parameters\) was trained on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian. Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT.
\[1\]: Arkhipov M., Trofimova M., Kuratov Y., Sorokin A. \(2019\). [Tuning Multilingual Transformers for Language-Specific Named Entity Recognition](https://www.aclweb.org/anthology/W19-3712/). ACL anthology W19-3712.

View File

@ -1,16 +0,0 @@
---
language: en
---
# bert-base-cased-conversational
Conversational BERT \(English, cased, 12layer, 768hidden, 12heads, 110M parameters\) was trained on the English part of Twitter, Reddit, DailyDialogues\[1\], OpenSubtitles\[2\], Debates\[3\], Blogs\[4\], Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took English cased version of BERTbase as an initialization for English Conversational BERT.
\[1\]: Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP 2017.
\[2\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\)
\[3\]: Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016.
\[4\]: J. Schler, M. Koppel, S. Argamon and J. Pennebaker \(2006\). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.

View File

@ -1,15 +0,0 @@
---
language:
- multilingual
---
# bert-base-multilingual-cased-sentence
Sentence Multilingual BERT \(101 languages, cased, 12layer, 768hidden, 12heads, 180M parameters\) is a representationbased sentence encoder for 101 languages of Multilingual BERT. It is initialized with Multilingual BERT and then finetuned on english MultiNLI\[1\] and on dev set of multilingual XNLI\[2\]. Sentence representations are mean pooled token embeddings in the same manner as in SentenceBERT\[3\].
\[1\]: Williams A., Nangia N. & Bowman S. \(2017\) A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. arXiv preprint [arXiv:1704.05426](https://arxiv.org/abs/1704.05426)
\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations. arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053)
\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084)

View File

@ -1,13 +0,0 @@
---
language:
- ru
---
# rubert-base-cased-conversational
Conversational RuBERT \(Russian, cased, 12layer, 768hidden, 12heads, 180M parameters\) was trained on OpenSubtitles\[1\], [Dirty](https://d3.ru/), [Pikabu](https://pikabu.ru/), and a Social Media segment of Taiga corpus\[2\]. We assembled a new vocabulary for Conversational RuBERT model on this data and initialized the model with [RuBERT](../rubert-base-cased).
\[1\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\)
\[2\]: Shavrina T., Shapovalova O. \(2017\) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

View File

@ -1,15 +0,0 @@
---
language:
- ru
---
# rubert-base-cased-sentence
Sentence RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) is a representationbased sentence encoder for Russian. It is initialized with RuBERT and finetuned on SNLI\[1\] google-translated to russian and on russian part of XNLI dev set\[2\]. Sentence representations are mean pooled token embeddings in the same manner as in SentenceBERT\[3\].
\[1\]: S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. \(2015\) A large annotated corpus for learning natural language inference. arXiv preprint [arXiv:1508.05326](https://arxiv.org/abs/1508.05326)
\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations. arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053)
\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084)

View File

@ -1,11 +0,0 @@
---
language:
- ru
---
# rubert-base-cased
RuBERT \(Russian, cased, 12layer, 768hidden, 12heads, 180M parameters\) was trained on the Russian part of Wikipedia and news data. We used this training data to build a vocabulary of Russian subtokens and took a multilingual version of BERTbase as an initialization for RuBERT\[1\].
\[1\]: Kuratov, Y., Arkhipov, M. \(2019\). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint [arXiv:1905.07213](https://arxiv.org/abs/1905.07213).

View File

@ -1,61 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
- text: "Paris est la [MASK] de la France."
- text: "Paris est la capitale de la [MASK]."
- text: "L'élection américaine a eu [MASK] en novembre 2020."
- text: "تقع سويسرا في [MASK] أوروبا"
- text: "إسمي محمد وأسكن في [MASK]."
---
# bert-base-15lang-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
The measurements below have been computed on a [Google Cloud n1-standard-1 machine (1 vCPU, 3.75 GB)](https://cloud.google.com/compute/docs/machine-types\#n1_machine_type):
| Model | Num parameters | Size | Memory | Loading time |
| ------------------------------- | -------------- | -------- | -------- | ------------ |
| bert-base-multilingual-cased | 178 million | 714 MB | 1400 MB | 4.2 sec |
| Geotrend/bert-base-15lang-cased | 141 million | 564 MB | 1098 MB | 3.1 sec |
Handled languages: en, fr, es, de, zh, ar, ru, vi, el, bg, th, tr, hi, ur and sw.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-15lang-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-15lang-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: ar
datasets: wikipedia
license: apache-2.0
widget:
- text: "تقع سويسرا في [MASK] أوروبا"
- text: "إسمي محمد وأسكن في [MASK]."
---
# bert-base-ar-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-ar-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-ar-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: bg
datasets: wikipedia
license: apache-2.0
---
# bert-base-bg-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-bg-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-bg-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: de
datasets: wikipedia
license: apache-2.0
---
# bert-base-de-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-de-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-de-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: el
datasets: wikipedia
license: apache-2.0
---
# bert-base-el-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-el-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-el-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,49 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
- text: "تقع سويسرا في [MASK] أوروبا"
- text: "إسمي محمد وأسكن في [MASK]."
---
# bert-base-en-ar-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-ar-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-ar-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-bg-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-bg-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-bg-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: en
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-de-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-de-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-de-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-el-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-el-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-el-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-es-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-es-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-es-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,50 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
- text: "Paris est la [MASK] de la France."
- text: "Paris est la capitale de la [MASK]."
- text: "L'élection américaine a eu [MASK] en novembre 2020."
---
# bert-base-en-fr-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-fr-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-fr-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-hi-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-hi-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-hi-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-ru-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-ru-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-ru-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-sw-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-sw-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-sw-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-th-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-th-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-th-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-tr-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-tr-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-tr-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-ur-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-ur-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-ur-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-vi-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-vi-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-vi-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: multilingual
datasets: wikipedia
license: apache-2.0
widget:
- text: "Google generated 46 billion [MASK] in revenue."
- text: "Paris is the capital of [MASK]."
- text: "Algiers is the largest city in [MASK]."
---
# bert-base-en-zh-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-zh-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-zh-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: es
datasets: wikipedia
license: apache-2.0
---
# bert-base-es-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-es-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-es-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,47 +0,0 @@
---
language: fr
datasets: wikipedia
license: apache-2.0
widget:
- text: "Paris est la [MASK] de la France."
- text: "Paris est la capitale de la [MASK]."
- text: "L'élection américaine a eu [MASK] en novembre 2020."
---
# bert-base-fr-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-fr-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-fr-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: hi
datasets: wikipedia
license: apache-2.0
---
# bert-base-hi-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-hi-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-hi-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: ru
datasets: wikipedia
license: apache-2.0
---
# bert-base-ru-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-ru-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-ru-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: sw
datasets: wikipedia
license: apache-2.0
---
# bert-base-sw-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-sw-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-sw-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: th
datasets: wikipedia
license: apache-2.0
---
# bert-base-th-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-th-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-th-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: tr
datasets: wikipedia
license: apache-2.0
---
# bert-base-tr-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-tr-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-tr-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: ur
datasets: wikipedia
license: apache-2.0
---
# bert-base-ur-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-ur-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-ur-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: vi
datasets: wikipedia
license: apache-2.0
---
# bert-base-vi-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-vi-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-vi-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,42 +0,0 @@
---
language: zh
datasets: wikipedia
license: apache-2.0
---
# bert-base-zh-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-zh-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-zh-cased")
```
To generate other smaller versions of multilingual transformers please visit [our Github repo](https://github.com/Geotrend-research/smaller-transformers).
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.

View File

@ -1,18 +0,0 @@
This model is used detecting **hatespeech** in **Arabic language**. The mono in the name refers to the monolingual setting, where the model is trained using only Arabic language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.877609 for a learning rate of 2e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **English language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.726030 for a learning rate of 2e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **French language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.692094 for a learning rate of 3e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **German language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.649794 for a learning rate of 3e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **Indonesian language**. The mono in the name refers to the monolingual setting, where the model is trained using only Arabic language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.844494 for a learning rate of 2e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **Italian language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.837288 for a learning rate of 3e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **Polish language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.723254 for a learning rate of 2e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **Portuguese language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.716119 for a learning rate of 3e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,20 +0,0 @@
This model is used detecting **hatespeech** in **Spanish language**. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model.
The model is trained with different learning rates and the best validation score achieved is 0.740287 for a learning rate of 3e-5. Training code can be found at this [url](https://github.com/punyajoy/DE-LIMIT)
### For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. "[Deep Learning Models for Multilingual Hate Speech Detection](https://arxiv.org/abs/2004.06465)". Accepted at ECML-PKDD 2020.
***Please cite our paper in any published work that uses any of these resources.***
~~~
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
~~~

View File

@ -1,124 +0,0 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@ -1,124 +0,0 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@ -1,124 +0,0 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@ -1,124 +0,0 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
---
## Introduction
This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpus into a proper format. This process produces more than 40M true sentences.
## Evaluation
ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
### Sentiment Analysis (SA) task
| Dataset | ParsBERT | mBERT | DeepSentiPers |
|:--------------------------:|:---------:|:-----:|:-------------:|
| Digikala User Comments | 81.74* | 80.74 | - |
| SnappFood User Comments | 88.12* | 87.87 | - |
| SentiPers (Multi Class) | 71.11* | - | 69.33 |
| SentiPers (Binary Class) | 92.13* | - | 91.98 |
### Text Classification (TC) task
| Dataset | ParsBERT | mBERT |
|:-----------------:|:--------:|:-----:|
| Digikala Magazine | 93.59* | 90.72 |
| Persian News | 97.19* | 95.79 |
### Named Entity Recognition (NER) task
| Dataset | ParsBERT | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| PEYMA | 93.10* | 86.64 | - | 90.59 | - | 84.00 | - |
| ARMAN | 98.79* | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
**If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference**
## How to use
### TensorFlow 2.0
```python
from transformers import AutoConfig, AutoTokenizer, TFAutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)
>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
```
### Pytorch
```python
from transformers import AutoConfig, AutoTokenizer, AutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
```
## NLP Tasks Tutorial
Coming soon stay tuned
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
## Releases
### Release v0.1 (May 27, 2019)
This is the first version of our ParsBERT based on BERT<sub>BASE</sub>

View File

@ -1,147 +0,0 @@
---
language: fa
tags:
- bert-fa
- bert-persian
- persian-lm
license: apache-2.0
---
# ParsBERT (v2.0)
A Transformer-based Model for Persian Language Understanding
We reconstructed the vocabulary and fine-tuned the ParsBERT v1.1 on the new Persian corpora in order to provide some functionalities for using ParsBERT in other scopes!
Please follow the [ParsBERT](https://github.com/hooshvare/parsbert) repo for the latest information about previous and current models.
## Introduction
ParsBERT is a monolingual language model based on Googles BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than `3.9M` documents, `73M` sentences, and `1.3B` words.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
## Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=bert-fa) to look for
fine-tuned versions on a task that interests you.
### How to use
#### TensorFlow 2.0
```python
from transformers import AutoConfig, AutoTokenizer, TFAutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
model = TFAutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد میتوانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)
>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
```
#### Pytorch
```python
from transformers import AutoConfig, AutoTokenizer, AutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
```
## Training
ParsBERT trained on a massive amount of public corpora ([Persian Wikidumps](https://dumps.wikimedia.org/fawiki/), [MirasText](https://github.com/miras-tech/MirasText)) and six other manually crawled text data from a various type of websites ([BigBang Page](https://bigbangpage.com/) `scientific`, [Chetor](https://www.chetor.com/) `lifestyle`, [Eligasht](https://www.eligasht.com/Blog/) `itinerary`, [Digikala](https://www.digikala.com/mag/) `digital magazine`, [Ted Talks](https://www.ted.com/talks) `general conversational`, Books `novels, storybooks, short stories from old to the contemporary era`).
As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.
## Goals
Objective goals during training are as below (after 300k steps).
``` bash
***** Eval results *****
global_step = 300000
loss = 1.4392426
masked_lm_accuracy = 0.6865794
masked_lm_loss = 1.4469004
next_sentence_accuracy = 1.0
next_sentence_loss = 6.534152e-05
```
## Derivative models
### Base Config
#### ParsBERT v2.0 Model
- [HooshvareLab/bert-fa-base-uncased](https://huggingface.co/HooshvareLab/bert-fa-base-uncased)
#### ParsBERT v2.0 Sentiment Analysis
- [HooshvareLab/bert-fa-base-uncased-sentiment-digikala](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-digikala)
- [HooshvareLab/bert-fa-base-uncased-sentiment-snappfood](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood)
- [HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary)
- [HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi)
#### ParsBERT v2.0 Text Classification
- [HooshvareLab/bert-fa-base-uncased-clf-digimag](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-clf-digimag)
- [HooshvareLab/bert-fa-base-uncased-clf-persiannews](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-clf-persiannews)
#### ParsBERT v2.0 NER
- [HooshvareLab/bert-fa-base-uncased-ner-peyma](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-peyma)
- [HooshvareLab/bert-fa-base-uncased-ner-arman](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-arman)
## Eval results
ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
### Sentiment Analysis (SA) Task
| Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | DeepSentiPers |
|:------------------------:|:-----------:|:-----------:|:-----:|:-------------:|
| Digikala User Comments | 81.72 | 81.74* | 80.74 | - |
| SnappFood User Comments | 87.98 | 88.12* | 87.87 | - |
| SentiPers (Multi Class) | 71.31* | 71.11 | - | 69.33 |
| SentiPers (Binary Class) | 92.42* | 92.13 | - | 91.98 |
### Text Classification (TC) Task
| Dataset | ParsBERT v2 | ParsBERT v1 | mBERT |
|:-----------------:|:-----------:|:-----------:|:-----:|
| Digikala Magazine | 93.65* | 93.59 | 90.72 |
| Persian News | 97.44* | 97.19 | 95.79 |
### Named Entity Recognition (NER) Task
| Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:-------:|:-----------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:|
| PEYMA | 93.40* | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - |
| ARMAN | 99.84* | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
### BibTeX entry and citation info
Please cite in publications as the following:
```bibtex
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Questions?
Post a Github issue on the [ParsBERT Issues](https://github.com/hooshvare/parsbert/issues) repo.

View File

@ -1,121 +0,0 @@
---
language: sv
---
# Swedish BERT Models
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
The following three models are currently available:
- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
All models are cased and trained with whole word masking.
## Files
| **name** | **files** |
|---------------------------------|-----------|
| bert-base-swedish-cased | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
| bert-base-swedish-cased-ner | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
TensorFlow model weights will be released soon.
## Usage requirements / installation instructions
The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
```
# git clone https://github.com/Kungbib/swedish-bert-models
# cd swedish-bert-models
# python3 -m venv venv
# source venv/bin/activate
# pip install --upgrade pip
# pip install -r requirements.txt
```
### BERT Base Swedish
A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
```
### BERT base fine-tuned for Swedish NER
This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
```python
from transformers import pipeline
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
nlp('Idag släpper KB tre språkmodeller.')
```
Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
```python
[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
{ 'word': 'KB', 'score': 0.9814832210540771, 'entity': 'ORG' } ]
```
The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
```python
text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
'som spelar fotboll i VM klockan två på kvällen.'
l = []
for token in nlp(text):
if token['word'].startswith('##'):
l[-1]['word'] += token['word'][2:]
else:
l += [ token ]
print(l)
```
Which should result in the following (though less cleanly formatted):
```python
[ { 'word': 'Engelbert', 'score': 0.99..., 'entity': 'PRS'},
{ 'word': 'Volvon', 'score': 0.99..., 'entity': 'OBJ'},
{ 'word': 'Tele2', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Arena', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Djurgården', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'IF', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'VM', 'score': 0.99..., 'entity': 'EVN'},
{ 'word': 'klockan', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'två', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'på', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'kvällen', 'score': 0.54..., 'entity': 'TME'} ]
```
### ALBERT base
The easiest way to do this is, again, using Huggingface Transformers:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
```
## Acknowledgements ❤️
- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
- Models are hosted on S3 by Huggingface 🤗

View File

@ -1,121 +0,0 @@
---
language: sv
---
# Swedish BERT Models
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
The following three models are currently available:
- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
All models are cased and trained with whole word masking.
## Files
| **name** | **files** |
|---------------------------------|-----------|
| bert-base-swedish-cased | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
| bert-base-swedish-cased-ner | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
TensorFlow model weights will be released soon.
## Usage requirements / installation instructions
The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
```
# git clone https://github.com/Kungbib/swedish-bert-models
# cd swedish-bert-models
# python3 -m venv venv
# source venv/bin/activate
# pip install --upgrade pip
# pip install -r requirements.txt
```
### BERT Base Swedish
A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
```
### BERT base fine-tuned for Swedish NER
This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
```python
from transformers import pipeline
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
nlp('Idag släpper KB tre språkmodeller.')
```
Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
```python
[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
{ 'word': 'KB', 'score': 0.9814832210540771, 'entity': 'ORG' } ]
```
The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
```python
text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
'som spelar fotboll i VM klockan två på kvällen.'
l = []
for token in nlp(text):
if token['word'].startswith('##'):
l[-1]['word'] += token['word'][2:]
else:
l += [ token ]
print(l)
```
Which should result in the following (though less cleanly formatted):
```python
[ { 'word': 'Engelbert', 'score': 0.99..., 'entity': 'PRS'},
{ 'word': 'Volvon', 'score': 0.99..., 'entity': 'OBJ'},
{ 'word': 'Tele2', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Arena', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Djurgården', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'IF', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'VM', 'score': 0.99..., 'entity': 'EVN'},
{ 'word': 'klockan', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'två', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'på', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'kvällen', 'score': 0.54..., 'entity': 'TME'} ]
```
### ALBERT base
The easiest way to do this is, again, using Huggingface Transformers:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
```
## Acknowledgements ❤️
- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
- Models are hosted on S3 by Huggingface 🤗

View File

@ -1,121 +0,0 @@
---
language: sv
---
# Swedish BERT Models
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
The following three models are currently available:
- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
All models are cased and trained with whole word masking.
## Files
| **name** | **files** |
|---------------------------------|-----------|
| bert-base-swedish-cased | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
| bert-base-swedish-cased-ner | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
TensorFlow model weights will be released soon.
## Usage requirements / installation instructions
The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
```
# git clone https://github.com/Kungbib/swedish-bert-models
# cd swedish-bert-models
# python3 -m venv venv
# source venv/bin/activate
# pip install --upgrade pip
# pip install -r requirements.txt
```
### BERT Base Swedish
A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
```
### BERT base fine-tuned for Swedish NER
This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
```python
from transformers import pipeline
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
nlp('Idag släpper KB tre språkmodeller.')
```
Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
```python
[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
{ 'word': 'KB', 'score': 0.9814832210540771, 'entity': 'ORG' } ]
```
The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
```python
text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
'som spelar fotboll i VM klockan två på kvällen.'
l = []
for token in nlp(text):
if token['word'].startswith('##'):
l[-1]['word'] += token['word'][2:]
else:
l += [ token ]
print(l)
```
Which should result in the following (though less cleanly formated):
```python
[ { 'word': 'Engelbert', 'score': 0.99..., 'entity': 'PRS'},
{ 'word': 'Volvon', 'score': 0.99..., 'entity': 'OBJ'},
{ 'word': 'Tele2', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Arena', 'score': 0.99..., 'entity': 'LOC'},
{ 'word': 'Djurgården', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'IF', 'score': 0.99..., 'entity': 'ORG'},
{ 'word': 'VM', 'score': 0.99..., 'entity': 'EVN'},
{ 'word': 'klockan', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'två', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'på', 'score': 0.99..., 'entity': 'TME'},
{ 'word': 'kvällen', 'score': 0.54..., 'entity': 'TME'} ]
```
### ALBERT base
The easisest way to do this is, again, using Huggingface Transformers:
```python
from transformers import AutoModel,AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
```
## Acknowledgements ❤️
- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
- Models are hosted on S3 by Huggingface 🤗

View File

@ -1,141 +0,0 @@
---
language: it
---
# GePpeTto GPT2 Model 🇮🇹
Pretrained GPT2 117M model for Italian.
You can find further details in the paper:
Lorenzo De Mattei, Michele Cafagna, Felice DellOrletta, Malvina Nissim, Marco Guerini "GePpeTto Carves Italian into a Language Model", arXiv preprint. Pdf available at: https://arxiv.org/abs/2004.14253
## Pretraining Corpus
The pretraining set comprises two main sources. The first one is a dump of Italian Wikipedia (November 2019),
consisting of 2.8GB of text. The second one is the ItWac corpus (Baroni et al., 2009), which amounts to 11GB of web
texts. This collection provides a mix of standard and less standard Italian, on a rather wide chronological span,
with older texts than the Wikipedia dump (the latter stretches only to the late 2000s).
## Pretraining details
This model was trained using GPT2's Hugging Face implemenation on 4 NVIDIA Tesla T4 GPU for 620k steps.
Training parameters:
- GPT-2 small configuration
- vocabulary size: 30k
- Batch size: 32
- Block size: 100
- Adam Optimizer
- Initial learning rate: 5e-5
- Warm up steps: 10k
## Perplexity scores
| Domain | Perplexity |
|---|---|
| Wikipedia | 26.1052 |
| ItWac | 30.3965 |
| Legal | 37.2197 |
| News | 45.3859 |
| Social Media | 84.6408 |
For further details, qualitative analysis and human evaluation check out: https://arxiv.org/abs/2004.14253
## Load Pretrained Model
You can use this model by installing Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import GPT2Tokenizer, GPT2Model
model = GPT2Model.from_pretrained('LorenzoDeMattei/GePpeTto')
tokenizer = GPT2Tokenizer.from_pretrained(
'LorenzoDeMattei/GePpeTto',
)
```
## Example using GPT2LMHeadModel
```python
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline, GPT2Tokenizer
tokenizer = AutoTokenizer.from_pretrained("LorenzoDeMattei/GePpeTto")
model = AutoModelWithLMHead.from_pretrained("LorenzoDeMattei/GePpeTto")
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
prompts = [
"Wikipedia Geppetto",
"Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso"]
samples_outputs = text_generator(
prompts,
do_sample=True,
max_length=50,
top_k=50,
top_p=0.95,
num_return_sequences=3
)
for i, sample_outputs in enumerate(samples_outputs):
print(100 * '-')
print("Prompt:", prompts[i])
for sample_output in sample_outputs:
print("Sample:", sample_output['generated_text'])
print()
```
Output is,
```
----------------------------------------------------------------------------------------------------
Prompt: Wikipedia Geppetto
Sample: Wikipedia Geppetto rosso (film 1920)
Geppetto rosso ("The Smokes in the Black") è un film muto del 1920 diretto da Henry H. Leonard.
Il film fu prodotto dalla Selig Poly
Sample: Wikipedia Geppetto
Geppetto ("Geppetto" in piemontese) è un comune italiano di 978 abitanti della provincia di Cuneo in Piemonte.
L'abitato, che si trova nel versante valtellinese, si sviluppa nella
Sample: Wikipedia Geppetto di Natale (romanzo)
Geppetto di Natale è un romanzo di Mario Caiano, pubblicato nel 2012.
----------------------------------------------------------------------------------------------------
Prompt: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso. Il burattino riesce a scappare. Dopo aver trovato un prezioso sacchetto si reca
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso, e l'unico che lo possiede, ma, di fronte a tutte queste prove
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso: - A voi gli occhi, le guance! A voi il mio pezzo!
```
## Citation
Please use the following bibtex entry:
```
@misc{mattei2020geppetto,
title={GePpeTto Carves Italian into a Language Model},
author={Lorenzo De Mattei and Michele Cafagna and Felice Dell'Orletta and Malvina Nissim and Marco Guerini},
year={2020},
eprint={2004.14253},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## References
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
and Eros Zanchetta. 2009. The WaCky wide web: a
collection of very large linguistically processed webcrawled corpora. Language resources and evaluation, 43(3):209226.

View File

@ -1,47 +0,0 @@
## About the model
The model has been trained on a collection of 500k articles with headings. Its purpose is to create a one-line heading suitable for the given article.
Sample code with a WikiNews article:
```python
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")
tokenizer = T5Tokenizer.from_pretrained("Michau/t5-base-en-generate-headline")
model = model.to(device)
article = '''
Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady Melania Trump tested positive for COVID-19. Officials said the Trumps' 14-year-old son Barron tested negative as did First Family and Senior Advisors Jared Kushner and Ivanka Trump.
Trump took to social media, posting at 12:54 am local time (0454 UTC) on Twitter, "Tonight, [Melania] and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!" Yesterday afternoon Marine One landed on the White House's South Lawn flying Trump to Walter Reed National Military Medical Center (WRNMMC) in Bethesda, Maryland.
Reports said both were showing "mild symptoms". Senior administration officials were tested as people were informed of the positive test. Senior advisor Hope Hicks had tested positive on Thursday.
Presidential physician Sean Conley issued a statement saying Trump has been given zinc, vitamin D, Pepcid and a daily Aspirin. Conley also gave a single dose of the experimental polyclonal antibodies drug from Regeneron Pharmaceuticals.
According to official statements, Trump, now operating from the WRNMMC, is to continue performing his duties as president during a 14-day quarantine. In the event of Trump becoming incapacitated, Vice President Mike Pence could take over the duties of president via the 25th Amendment of the US Constitution. The Pence family all tested negative as of yesterday and there were no changes regarding Pence's campaign events.
'''
text = "headline: " + article
max_len = 256
encoding = tokenizer.encode_plus(text, return_tensors = "pt")
input_ids = encoding["input_ids"].to(device)
attention_masks = encoding["attention_mask"].to(device)
beam_outputs = model.generate(
input_ids = input_ids,
attention_mask = attention_masks,
max_length = 64,
num_beams = 3,
early_stopping = True,
)
result = tokenizer.decode(beam_outputs[0])
print(result)
```
Result:
```Trump and First Lady Melania Test Positive for COVID-19```

View File

@ -1,74 +0,0 @@
---
language: tn
---
# TswanaBert
Pretrained model on the Tswana language using a masked language modeling (MLM) objective.
## Model Description.
TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
## Intended uses & limitations
The model can be used for either masked language modeling or next word prediction. It can also be fine-tuned on a specific down-stream NLP application.
#### How to use
```python
>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead
>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Ntshopotse <mask> e godile.")
[{'score': 0.32749542593955994,
'sequence': '<s>Ntshopotse setse e godile.</s>',
'token': 538,
'token_str': 'Ġsetse'},
{'score': 0.060260992497205734,
'sequence': '<s>Ntshopotse le e godile.</s>',
'token': 270,
'token_str': 'Ġle'},
{'score': 0.058460816740989685,
'sequence': '<s>Ntshopotse bone e godile.</s>',
'token': 364,
'token_str': 'Ġbone'},
{'score': 0.05694682151079178,
'sequence': '<s>Ntshopotse ga e godile.</s>',
'token': 298,
'token_str': 'Ġga'},
{'score': 0.0565204992890358,
'sequence': '<s>Ntshopotse, e godile.</s>',
'token': 16,
'token_str': ','}]
```
#### Limitations and bias
The model is trained on a relatively small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet.
## Training data
1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
2. I Then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that is generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus.
3. I went on to add 300 more sentences by scrapping following news sites and blogs that mosty originate in Botswana. I actively continue to expand the dataset.
* http://setswana.blogspot.com/
* https://omniglot.com/writing/tswana.php
* http://www.dailynews.gov.bw/
* http://www.mmegi.bw/index.php
* https://tsena.co.bw
* http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html
* https://www.poemhunter.com/poem/2013-setswana/
https://www.poemhunter.com/poem/ngwana-wa-mosetsana/
### BibTeX entry and citation info
```bibtex
@inproceedings{author = {Moseli Motsoehli},
year={2020}
}
```

View File

@ -1,56 +0,0 @@
---
language: zu
---
# zuBERTa
zuBERTa is a RoBERTa style transformer language model trained on zulu text.
## Intended uses & limitations
The model can be used for getting embeddings to use on a down-stream task such as question answering.
#### How to use
```python
>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead
>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Abafika eNkandla bafika sebeholwa <mask> uMpongo kaZingelwayo.")
[
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.</s>",
"score": 0.050459690392017365,
"token": 555,
"token_str": "Ġkhona"
},
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.</s>",
"score": 0.03668094798922539,
"token": 2321,
"token_str": "Ġinkosi"
},
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.</s>",
"score": 0.028774697333574295,
"token": 5101,
"token_str": "Ġubukhosi"
}
]
```
## Training data
1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings.
2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages).
### BibTeX entry and citation info
```bibtex
@inproceedings{author = {Moseli Motsoehli},
title = {Towards transformation of Southern African language models through transformers.},
year={2020}
}
```

View File

@ -1,118 +0,0 @@
---
language: it
---
# UmBERTo Commoncrawl Cased
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
<p align="center">
<img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
</p>
## Dataset
UmBERTo-Commoncrawl-Cased utilizes the Italian subcorpus of [OSCAR](https://traces1.inria.fr/oscar/) as training set of the language model. We used deduplicated version of the Italian corpus that consists in 70 GB of plain text data, 210M sentences with 11B words where the sentences have been filtered and shuffled at line level in order to be used for NLP research.
## Pre-trained model
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
| ------ | ------ | ------ | ------ | ------ |------ | ------ |
| `umberto-commoncrawl-cased-v1` | YES | YES | SPM | 32K | 125k | [Link](http://bit.ly/35zO7GH) |
This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
## Downstream Tasks
These results refers to umberto-commoncrawl-cased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
#### Named Entity Recognition (NER)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **ICAB-EvalITA07** | **87.565** | 86.596 | 88.556 | 98.690 |
| **WikiNER-ITA** | **92.531** | 92.509 | 92.553 | 99.136 |
#### Part of Speech (POS)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **UD_Italian-ISDT** | 98.870 | 98.861 | 98.879 | **98.977** |
| **UD_Italian-ParTUT** | 98.786 | 98.812 | 98.760 | **98.903** |
## Usage
##### Load UmBERTo with AutoModel, Autotokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
```
##### Predict masked token:
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-commoncrawl-cased-v1",
tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032}
# {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471}
# {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654}
# {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908}
# {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}
```
## Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
```
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
```
## Authors
**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
## About Musixmatch AI
![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)

View File

@ -1,117 +0,0 @@
---
language: it
---
# UmBERTo Wikipedia Uncased
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
<p align="center">
<img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
</p>
## Dataset
UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from [Wikipedia-ITA](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
## Pre-trained model
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
| ------ | ------ | ------ | ------ | ------ |------ | ------ |
| `umberto-wikipedia-uncased-v1` | YES | YES | SPM | 32K | 100k | [Link](http://bit.ly/35wbSj6) |
This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
## Downstream Tasks
These results refers to umberto-wikipedia-uncased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
#### Named Entity Recognition (NER)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ----- |
| **ICAB-EvalITA07** | **86.240** | 85.939 | 86.544 | 98.534 |
| **WikiNER-ITA** | **90.483** | 90.328 | 90.638 | 98.661 |
#### Part of Speech (POS)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **UD_Italian-ISDT** | 98.563 | 98.508 | 98.618 | **98.717** |
| **UD_Italian-ParTUT** | 97.810 | 97.835 | 97.784 | **98.060** |
## Usage
##### Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
```
##### Predict masked token:
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-wikipedia-uncased-v1",
tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
```
## Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
```
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
```
## Authors
**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
## About Musixmatch AI
![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)

View File

@ -1,54 +0,0 @@
# MS-BERT
## Introduction
This repository provides codes and models of MS-BERT.
MS-BERT was pre-trained on notes from neurological examination for Multiple Sclerosis (MS) patients at St. Michael's Hospital in Toronto, Canada.
## Data
The dataset contained approximately 75,000 clinical notes, for about 5000 patients, totaling to over 35.7 million words.
These notes were collected from patients who visited St. Michael's Hospital MS Clinic between 2015 to 2019.
The notes contained a variety of information pertaining to a neurological exam.
For example, a note can contain information on the patient's condition, their progress over time and diagnosis.
The gender split within the dataset was observed to be 72% female and 28% male ([which reflects the natural discrepancy seen in MS][1]).
Further sections will describe how MS-BERT was pre trained through the use of these clinically relevant and rich neurological notes.
## Data pre-processing
The data was pre-processed to remove any identifying information. This includes information on: patient names, doctor names, hospital names, patient identification numbers, phone numbers, addresses, and time. In order to de-identify the information, we used a curated database that contained patient and doctor information. This curated database was paired with regular expressions to find and remove any identifying pieces of information. Each of these identifiers were replaced with a specific token. These tokens were chosen based on three criteria: (1) they belong to the current BERT vocab, (2), they have relatively the same semantic meaning as the word they are replacing, and (3), the token is not found in the original unprocessed dataset. The replacements that met the criteria above were as follows:
Female first names -> Lucie
Male first names -> Ezekiel
Last/family names -> Salamanca.
Dates -> 2010s
Patient IDs -> 999
Phone numbers -> 1718
Addresses -> Silesia
Time -> 1610
Locations/Hospital/Clinic names -> Troy
## Pre-training
The starting point for our model is the already pre-trained and fine-tuned BLUE-BERT base. We further pre-train it using the masked language modelling task from the huggingface transformers [library](https://github.com/huggingface).
The hyperparameters can be found in the config file in this repository or [here](https://s3.amazonaws.com/models.huggingface.co/bert/NLP4H/ms_bert/config.json)
## Acknowledgements
We would like to thank the researchers and staff at the Data Science and Advanced Analytics (DSAA) department, St. Michaels Hospital, for providing consistent support and guidance throughout this project.
We would also like to thank Dr. Marzyeh Ghassemi, Taylor Killan, Nathan Ng and Haoran Zhang for providing us the opportunity to work on this exciting project.
## Disclaimer
MS-BERT shows the results of research conducted at the Data Science and Advanced Analytics (DSAA) department, St. Michaels Hospital. The results produced by MS-BERT are not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not make decisions about their health solely on the basis of the results produced by MS-BERT. St. Michaels Hospital does not independently verify the validity or utility of the results produced by MS-BERT. If you have questions about the results produced by MS-BERT please consult a healthcare professional. If you would like more information about the research conducted at DSAA please contact [Zhen Yang](mailto:zhen.yang@unityhealth.to). If you would like more information on neurological examination notes please contact [Dr. Tony Antoniou](mailto:tony.antoniou@unityhealth.to) or [Dr. Jiwon Oh](mailto:jiwon.oh@unityhealth.to) from the MS clinic at St. Michael's Hospital.
[1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3707353/

View File

@ -1,28 +0,0 @@
---
language: kn
---
# Welcome to KanBERTo (ಕನ್ಬರ್ಟೋ)
## Model Description
> This is a small language model for [Kannada](https://en.wikipedia.org/wiki/Kannada) language with 1M data samples taken from
[OSCAR page](https://traces1.inria.fr/oscar/files/compressed-orig/kn.txt.gz)
## Training params
- **Dataset** - 1M data samples are used to train this model from OSCAR page(https://traces1.inria.fr/oscar/) eventhough data set is of 1.7 GB due to resource constraint to train
I have picked only 1M data from the total 1.7GB data set. If you are interested in collaboration and have computational resources to train on you are most welcome to do so.
- **Preprocessing** - ByteLevelBPETokenizer is used to tokenize the sentences at character level and vocabulary size is set to 52k as per standard values given by 🤗
- **Hyperparameters** - __ByteLevelBPETokenizer__ : vocabulary size = 52_000 and min_frequency = 2
__Trainer__ : num_train_epochs=12 - trained for 12 epochs
per_gpu_train_batch_size=64 - batch size for the datasamples is 64
save_steps=10_000 - save model for every 10k steps
save_total_limit=2 - save limit is set for 2
**Intended uses & limitations**
this is for anyone who wants to make use of kannada language models for various tasks like language generation, translation and many more use cases.
**Whatever else is helpful!**
If you are intersted in collaboration feel free to reach me [Naveen](mailto:naveen.maltesh@gmail.com)

View File

@ -1,26 +0,0 @@
# BERT-Small CORD-19 fine-tuned on SQuAD 2.0
[bert-small-cord19 model](https://huggingface.co/NeuML/bert-small-cord19) fine-tuned on SQuAD 2.0
## Building the model
```bash
python run_squad.py
--model_type bert
--model_name_or_path bert-small-cord19
--do_train
--do_eval
--do_lower_case
--version_2_with_negative
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--per_gpu_train_batch_size 8
--learning_rate 3e-5
--num_train_epochs 3.0
--max_seq_length 384
--doc_stride 128
--output_dir bert-small-cord19-squad2
--save_steps 0
--threads 8
--overwrite_cache
--overwrite_output_dir

View File

@ -1,25 +0,0 @@
# BERT-Small fine-tuned on CORD-19 dataset
[BERT L6_H-512_A-8 model](https://huggingface.co/google/bert_uncased_L-6_H-512_A-8) fine-tuned on the [CORD-19 dataset](https://www.semanticscholar.org/cord19).
## CORD-19 data subset
The training data for this dataset is stored as a [Kaggle dataset](https://www.kaggle.com/davidmezzetti/cord19-qa?select=cord19.txt). The training
data is a subset of the full corpus, focusing on high-quality, study-design detected articles.
## Building the model
```bash
python run_language_modeling.py
--model_type bert
--model_name_or_path google/bert_uncased_L-6_H-512_A-8
--do_train
--mlm
--line_by_line
--block_size 512
--train_data_file cord19.txt
--per_gpu_train_batch_size 4
--learning_rate 3e-5
--num_train_epochs 3.0
--output_dir bert-small-cord19
--save_steps 0
--overwrite_output_dir

View File

@ -1,63 +0,0 @@
# BERT-Small fine-tuned on CORD-19 QA dataset
[bert-small-cord19-squad model](https://huggingface.co/NeuML/bert-small-cord19-squad2) fine-tuned on the [CORD-19 QA dataset](https://www.kaggle.com/davidmezzetti/cord19-qa?select=cord19-qa.json).
## CORD-19 QA dataset
The CORD-19 QA dataset is a SQuAD 2.0 formatted list of question, context, answer combinations covering the [CORD-19 dataset](https://www.semanticscholar.org/cord19).
## Building the model
```bash
python run_squad.py \
--model_type bert \
--model_name_or_path bert-small-cord19-squad \
--do_train \
--do_lower_case \
--version_2_with_negative \
--train_file cord19-qa.json \
--per_gpu_train_batch_size 8 \
--learning_rate 5e-5 \
--num_train_epochs 10.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir bert-small-cord19qa \
--save_steps 0 \
--threads 8 \
--overwrite_cache \
--overwrite_output_dir
```
## Testing the model
Example usage below:
```python
from transformers import pipeline
qa = pipeline(
"question-answering",
model="NeuML/bert-small-cord19qa",
tokenizer="NeuML/bert-small-cord19qa"
)
qa({
"question": "What is the median incubation period?",
"context": "The incubation period is around 5 days (range: 4-7 days) with a maximum of 12-13 day"
})
qa({
"question": "What is the incubation period range?",
"context": "The incubation period is around 5 days (range: 4-7 days) with a maximum of 12-13 day"
})
qa({
"question": "What type of surfaces does it persist?",
"context": "The virus can survive on surfaces for up to 72 hours such as plastic and stainless steel ."
})
```
```json
{"score": 0.5970273583242793, "start": 32, "end": 38, "answer": "5 days"}
{"score": 0.999555868193891, "start": 39, "end": 56, "answer": "(range: 4-7 days)"}
{"score": 0.9992726505196998, "start": 61, "end": 88, "answer": "plastic and stainless steel"}
```

View File

@ -1,55 +0,0 @@
---
language: vn
---
# BERT for Vietnamese is trained on more 20 GB news dataset
Apply for task sentiment analysis on using [AIViVN's comments dataset](https://www.aivivn.com/contests/6)
The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
***************New Mar 11 , 2020 ***************
**[BERT](https://github.com/google-research/bert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
You can download trained model:
- [tensorflow](https://drive.google.com/file/d/1X-sRDYf7moS_h61J3L79NkMVGHP-P-k5/view?usp=sharing).
- [pytorch](https://drive.google.com/file/d/11aFSTpYIurn-oI2XpAmcCTccB_AonMOu/view?usp=sharing).
Use with huggingface/transformers
``` bash
import torch
from transformers import AutoTokenizer,AutoModel
tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")
line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
input_id = tokenizer.encode(line,add_special_tokens = True)
att_mask = [int(token_id > 0) for token_id in input_id]
input_ids = torch.tensor([input_id])
att_masks = torch.tensor([att_mask])
with torch.no_grad():
features = bert_model(input_ids,att_masks)
print(features)
```
Run training with base config
``` bash
python train_pytorch.py \
--model_path=bert4news.pytorch \
--max_len=200 \
--batch_size=16 \
--epochs=6 \
--lr=2e-5
```
### Contact information
For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).

View File

@ -1,109 +0,0 @@
---
language: he
thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg
widget:
- text: "<|startoftext|>החוק השני של מועדון קרב הוא"
- text: "<|startoftext|>ראש הממשלה בן גוריון"
- text: "<|startoftext|>למידת מכונה (סרט)"
- text: "<|startoftext|>מנשה פומפרניקל"
- text: "<|startoftext|>אי שוויון "
license: mit
---
# hewiki-articles-distilGPT2py-il
## A tiny GPT2 model for generating Hebrew text
A distilGPT2 sized model. <br>
Training data was hewiki-20200701-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/hewiki/20200701/ <br>
XML has been converted to plain text using Wikipedia Extractor http://medialab.di.unipi.it/wiki/Wikipedia_Extractor <br>
I then added <|startoftext|> and <|endoftext|> markers and deleted empty lines. <br>
#### How to use
```python
import torch
import torch.nn as nn
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("Norod78/hewiki-articles-distilGPT2py-il")
model = GPT2LMHeadModel.from_pretrained("Norod78/hewiki-articles-distilGPT2py-il").eval()
bos_token = tokenizer.bos_token #Beginning of sentace
eos_token = tokenizer.eos_token #End of sentence
def generate_word(model, tokens_tensor, temperature=1.0):
"""
Sample a word given a tensor of tokens of previous words from a model. Given
the words we have, sample a plausible word. Temperature is used for
controlling randomness. If using temperature==0 we simply use a greedy arg max.
Else, we sample from a multinomial distribution using a lower inverse
temperature to allow for more randomness to escape repetitions.
"""
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
if temperature>0:
# Make the distribution more or less skewed based on the temperature
predictions = outputs[0]/temperature
# Sample from the distribution
softmax = nn.Softmax(dim=0)
predicted_index = torch.multinomial(softmax(predictions[0,-1,:]),1).item()
# Simply take the arg-max of the distribution
else:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
# Decode the encoding to the corresponding word
predicted_text = tokenizer.decode([predicted_index])
return predicted_text
def generate_sentence(model, tokenizer, initial_text, temperature=1.0):
""" Generate a sentence given some initial text using a model and a tokenizer.
Returns the new sentence. """
# Encode a text inputs
text = ""
sentence = text
# We avoid an infinite loop by setting a maximum range
for i in range(0,84):
indexed_tokens = tokenizer.encode(initial_text + text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
new_word = generate_word(model, tokens_tensor, temperature=temperature)
# Here the temperature is slowly decreased with each generated word,
# this ensures that the sentence (ending) makes more sense.
# We don't decrease to a temperature of 0.0 to leave some randomness in.
if temperature<(1-0.008):
temperature += 0.008
else:
temperature = 0.996
text = text+new_word
# Stop generating new words when we have reached the end of the line or the poem
if eos_token in new_word:
# returns new sentence and whether poem is done
return (text.replace(eos_token,"").strip(), True)
elif '/' in new_word:
return (text.strip(), False)
elif bos_token in new_word:
return (text.replace(bos_token,"").strip(), False)
return (text, True)
for output_num in range(1,5):
init_text = "בוקר טוב"
text = bos_token + init_text
for i in range(0,84):
sentence = generate_sentence(model, tokenizer, text, temperature=0.9)
text = init_text + sentence[0]
print(text)
if (sentence[1] == True):
break
```

View File

@ -1,48 +0,0 @@
---
language:
- ach
- en
tags:
- translation
license: cc-by-4.0
datasets:
- JW300
metrics:
- bleu
---
# HEL-ACH-EN
## Model description
MT model translating Acholi to English initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en) on HuggingFace.
## Intended uses & limitations
Machine Translation experiments. Do not use for sensitive tasks.
#### How to use
```python
# You can include sample code which will be formatted
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Ogayo/Hel-ach-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Ogayo/Hel-ach-en")
```
#### Limitations and bias
Trained on Jehovah Witnesses data so contains theirs and Christian views.
## Training data
Trained on OPUS JW300 data.
Initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en?text=Bed+gi+nyasi+mar+chieng%27+nyuol+mopong%27+gi+mor%21#model_card)
## Training procedure
Remove duplicates and rows with no alphabetic characters. Used GPU
## Eval results
testset | BLEU
--- | ---
JW300.luo.en| 46.1

View File

@ -1,63 +0,0 @@
---
language: "en"
---
# BART-Squad2
## Model description
BART for extractive (span-based) question answering, trained on Squad 2.0.
F1 score of 87.4.
## Intended uses & limitations
Unfortunately, the Huggingface auto-inference API won't run this model, so if you're attempting to try it through the input box above and it complains, don't be discouraged!
#### How to use
Here's a quick way to get question answering running locally:
```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("Primer/bart-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("Primer/bart-squad2")
model.to('cuda'); model.eval()
def answer(question, text):
seq = '<s>' + question + ' </s> </s> ' + text + ' </s>'
tokens = tokenizer.encode_plus(seq, return_tensors='pt', padding='max_length', max_length=1024)
input_ids = tokens['input_ids'].to('cuda')
attention_mask = tokens['attention_mask'].to('cuda')
start, end, _ = model(input_ids, attention_mask=attention_mask)
start_idx = int(start.argmax().int())
end_idx = int(end.argmax().int())
print(tokenizer.decode(input_ids[0, start_idx:end_idx]).strip())
# ^^ it will be an empty string if the model decided "unanswerable"
>>> question = "Where does Tom live?"
>>> context = "Tom is an engineer in San Francisco."
>>> answer(question, context)
San Francisco
```
(Just drop the `.to('cuda')` stuff if running on CPU).
#### Limitations and bias
Unknown, no further evaluation has been performed. In a technical sense one big limitation is that it's 1.6G 😬
## Training procedure
`run_squad.py` with:
|param|value|
|---|---|
|batch size|8|
|max_seq_length|1024|
|learning rate|1e-5|
|epochs|2|
Modified to freeze shared parameters and encoder embeddings.

26
model_cards/README.md Normal file
View File

@ -0,0 +1,26 @@
## 🔥 Model cards now live inside each huggingface.co model repo 🔥
For consistency, ease of use and scalability, `README.md` model cards now live directly inside each model repo on the HuggingFace model hub.
### How to update a model card
You can directly update a model card inside any model repo you have **write access** to, i.e.:
- a model under your username namespace
- a model under any organization you are a part of.
You can either:
- update it, commit and push using your usual git workflow (command line, GUI, etc.)
- or edit it directly from the website's UI.
**What if you want to create or update a model card for a model you don't have write access to?**
In that case, given that we don't have a Pull request system yet on huggingface.co (🤯),
you can open an issue here, post the card's content, and tag the model author(s) and/or the Hugging Face team.
We might implement a more seamless process at some point, so your early feedback is precious!
Please let us know of any suggestion.
### What happened to the model cards here?
We migrated every model card from the repo to its corresponding huggingface.co model repo. Individual commits were preserved, and they link back to the original commit on GitHub.

View File

@ -1,141 +0,0 @@
---
language: protein
tags:
- protein language model
datasets:
- Uniref100
---
# ProtBert model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtBert is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents.
This means the Next sentence prediction is not used, as each sequence is treated as a complete document.
The masking follows the original Bert training with randomly masks 15% of the amino acids in the input.
At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import BertForMaskedLM, BertTokenizer, pipeline
>>> tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
>>> model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
[{'score': 0.11088453233242035,
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
'token': 5,
'token_str': 'L'},
{'score': 0.08402521163225174,
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
'token': 10,
'token_str': 'S'},
{'score': 0.07328339666128159,
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
'token': 8,
'token_str': 'V'},
{'score': 0.06921856850385666,
'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]',
'token': 12,
'token_str': 'K'},
{'score': 0.06382402777671814,
'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]',
'token': 11,
'token_str': 'I'}]
```
Here is how to use this model to get the features of a given protein sequence in PyTorch:
```python
from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
The ProtBert model was pretrained on [Uniref100](https://www.uniprot.org/downloads), a dataset consisting of 217 million protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X".
The inputs of the model are then of the form:
```
[CLS] Protein Sequence A [SEP] Protein Sequence B [SEP]
```
Furthermore, each protein sequence was treated as a separate document.
The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids.
The details of the masking procedure for each sequence followed the original Bert model as following:
- 15% of the amino acids are masked.
- In 80% of the cases, the masked amino acids are replaced by `[MASK]`.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
- In the 10% remaining cases, the masked amino acids are left as is.
### Pretraining
The model was trained on a single TPU Pod V3-512 for 400k steps in total.
300K steps using sequence length 512 (batch size 15k), and 100K steps using sequence length 2048 (batch size 2.5k).
The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 40k steps and linear decay of the learning rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 75 | 63 | | |
| TS115 | 83 | 72 | | |
| CB513 | 81 | 66 | | |
| DeepLoc | | | 79 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)

View File

@ -1,141 +0,0 @@
---
language: protein
tags:
- protein language model
datasets:
- BFD
---
# ProtBert-BFD model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtBert-BFD is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents
This means the Next sentence prediction is not used, as each sequence is treated as a complete document.
The masking follows the original Bert training with randomly masks 15% of the amino acids in the input.
At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import BertForMaskedLM, BertTokenizer, pipeline
>>> tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False )
>>> model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert_bfd")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
[{'score': 0.1165614128112793,
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
'token': 5,
'token_str': 'L'},
{'score': 0.08976086974143982,
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
'token': 8,
'token_str': 'V'},
{'score': 0.08864385634660721,
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
'token': 10,
'token_str': 'S'},
{'score': 0.06227643042802811,
'sequence': '[CLS] D L I P T S S K L V V A D T S L Q V K K A F F A L V T [SEP]',
'token': 6,
'token_str': 'A'},
{'score': 0.06194969266653061,
'sequence': '[CLS] D L I P T S S K L V V T D T S L Q V K K A F F A L V T [SEP]',
'token': 15,
'token_str': 'T'}]
```
Here is how to use this model to get the features of a given protein sequence in PyTorch:
```python
from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert_bfd")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
The ProtBert-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21.
The inputs of the model are then of the form:
```
[CLS] Protein Sequence A [SEP] Protein Sequence B [SEP]
```
Furthermore, each protein sequence was treated as a separate document.
The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids.
The details of the masking procedure for each sequence followed the original Bert model as following:
- 15% of the amino acids are masked.
- In 80% of the cases, the masked amino acids are replaced by `[MASK]`.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
- In the 10% remaining cases, the masked amino acids are left as is.
### Pretraining
The model was trained on a single TPU Pod V3-1024 for one million steps in total.
800k steps using sequence length 512 (batch size 32k), and 200K steps using sequence length 2048 (batch size 6k).
The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 140k steps and linear decay of the learning rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 76 | 65 | | |
| TS115 | 84 | 73 | | |
| CB513 | 83 | 70 | | |
| DeepLoc | | | 78 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)

View File

@ -1,125 +0,0 @@
---
language: protein
tags:
- protein language model
datasets:
- BFD
---
# ProtT5-XL-BFD model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between this T5 model and the original T5 version is the denosing objective.
The original T5-3B model was pretrained using a span denosing objective, while this model was pre-trained with a Bart-like MLM denosing objective.
The masking probability is consistent with the original T5 training by randomly masking 15% of the amino acids in the input.
It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks on can gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
We have also noticed that for feature extraction, its better to use the feature extracted from the encoder not from the decoder.
### How to use
Here is how to use this model to extract the features of a given protein sequence in PyTorch:
```python
from transformers import T5Tokenizer, T5Model
import re
import torch
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")
sequences_Example = ["A E T C Z A O","S K T Z P"]
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)
input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])
with torch.no_grad():
embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)
# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()
```
## Training data
The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X".
The inputs of the model are then of the form:
```
Protein Sequence [EOS]
```
The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
The details of the masking procedure for each sequence are as follows:
- 15% of the amino acids are masked.
- In 90% of the cases, the masked amino acids are replaced by `[MASK]` token.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
### Pretraining
The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k).
It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
## Evaluation results
When the model is used for feature etraction, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 77 | 66 | | |
| TS115 | 85 | 74 | | |
| CB513 | 84 | 71 | | |
| DeepLoc | | | 77 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)

View File

@ -1,46 +0,0 @@
---
language: hu
license: apache-2.0
datasets:
- common_crawl
- wikipedia
---
# huBERT base model (cased)
## Model description
Cased BERT model for Hungarian, trained on the (filtered, deduplicated) Hungarian subset of the Common Crawl and a snapshot of the Hungarian Wikipedia.
## Intended uses & limitations
The model can be used as any other (cased) BERT model. It has been tested on the chunking and
named entity recognition tasks and set a new state-of-the-art on the former.
## Training
Details of the training data and procedure can be found in the PhD thesis linked below. (With the caveat that it only contains preliminary results
based on the Wikipedia subcorpus. Evaluation of the full model will appear in a future paper.)
## Eval results
When fine-tuned (via `BertForTokenClassification`) on chunking and NER, the model outperforms multilingual BERT, achieves state-of-the-art results on the
former task and comes within 0.5% F1 to the SotA on the latter. The exact scores are
| NER | Minimal NP | Maximal NP |
|-----|------------|------------|
| 97.62% | **97.14%** | **96.97%** |
### BibTeX entry and citation info
The training corpus, parameters and the evaluation methods are discussed in the
[following PhD thesis](https://hlt.bme.hu/en/publ/nemeskey_2020):
```bibtex
@PhDThesis{ Nemeskey:2020,
author = {Nemeskey, Dávid Márk},
title = {Natural Language Processing Methods for Language Modeling},
year = {2020},
school = {E\"otv\"os Lor\'and University}
}
```

View File

@ -1,50 +0,0 @@
# Roberta Large STS-B
This model is a fine tuned RoBERTA model over STS-B.
It was trained with these params:
!python /content/transformers/examples/text-classification/run_glue.py \
--model_type roberta \
--model_name_or_path roberta-large \
--task_name STS-B \
--do_train \
--do_eval \
--do_lower_case \
--data_dir /content/glue_data/STS-B/ \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /content/roberta-sts-b
## How to run
```python
import toolz
import torch
batch_size = 6
def roberta_similarity_batches(to_predict):
batches = toolz.partition(batch_size, to_predict)
similarity_scores = []
for batch in batches:
sentences = [(sentence_similarity["sent1"], sentence_similarity["sent2"]) for sentence_similarity in batch]
batch_scores = similarity_roberta(model, tokenizer,sentences)
similarity_scores = similarity_scores + batch_scores[0].cpu().squeeze(axis=1).tolist()
return similarity_scores
def similarity_roberta(model, tokenizer, sent_pairs):
batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())
return res
similarity_roberta(model, tokenizer, [('NEW YORK--(BUSINESS WIRE)--Rosen Law Firm, a global investor rights law firm, announces it is investigating potential securities claims on behalf of shareholders of Vale S.A. ( VALE ) resulting from allegations that Vale may have issued materially misleading business information to the investing public',
'EQUITY ALERT: Rosen Law Firm Announces Investigation of Securities Claims Against Vale S.A. VALE')])
```

View File

@ -1,9 +0,0 @@
---
language: de
license: mit
---
# bert-german-dbmdz-uncased-sentence-stsb
**This model is outdated!**
The new [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model is better for German language. It is also the current best model for English language and works cross-lingually. Please consider using that model.

View File

@ -1,85 +0,0 @@
---
language:
- de
- en
license: mit
tags:
- sentence_embedding
- search
- pytorch
- xlm-roberta
- roberta
- xlm-r-distilroberta-base-paraphrase-v1
- paraphrase
datasets:
- STSbenchmark
metrics:
- Spearmans rank correlation
- cosine similarity
---
# Cross English & German RoBERTa for Sentence Embeddings
This model is intended to [compute sentence (text) embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html) for English and German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to find sentences with a similar semantic meaning. For example this can be useful for [semantic textual similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html), [semantic search](https://www.sbert.net/docs/usage/semantic_search.html), or [paraphrase mining](https://www.sbert.net/docs/usage/paraphrase_mining.html). To do this you have to use the [Sentence Transformers Python framework](https://github.com/UKPLab/sentence-transformers).
The speciality of this model is that it also works cross-lingually. Regardless of the language, the sentences are translated into very similar vectors according to their semantics. This means that you can, for example, enter a search in German and find results according to the semantics in German and also in English. Using a xlm model and _multilingual finetuning with language-crossing_ we reach performance that even exceeds the best current dedicated English large model (see Evaluation section below).
> Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Source: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
This model is fine-tuned from [Philip May](https://eniak.de/) and open-sourced by [T-Systems-onsite](https://www.t-systems-onsite.de/). Special thanks to [Nils Reimers](https://www.nils-reimers.de/) for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.
## How to use
**The usage description above - provided by Hugging Face - is wrong for sentence embeddings! Please use this:**
To use this model install the `sentence-transformers` package (see here: <https://github.com/UKPLab/sentence-transformers>).
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
```
For details of usage and examples see here:
- [Computing Sentence Embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Paraphrase Mining](https://www.sbert.net/docs/usage/paraphrase_mining.html)
- [Semantic Search](https://www.sbert.net/docs/usage/semantic_search.html)
- [Cross-Encoders](https://www.sbert.net/docs/usage/cross-encoder.html)
- [Examples on GitHub](https://github.com/UKPLab/sentence-transformers/tree/master/examples)
## Training
The base model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). This model has been further trained by [Nils Reimers](https://www.nils-reimers.de/) on a large scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils-reimers.de/) about this [on GitHub](https://github.com/UKPLab/sentence-transformers/issues/509#issuecomment-712243280):
>A paper is upcoming for the paraphrase models.
>
>These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.
>
>In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
>
>More details with the setup, all the datasets, and a wider evaluation will follow soon.
The resulting model called `xlm-r-distilroberta-base-paraphrase-v1` has been released here: <https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8>
Building on this cross language model we fine-tuned it for English and German language on the [STSbenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) dataset. For German language we used the dataset of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark) which has been translated with [deepl.com](https://www.deepl.com/translator). Additionally to the German and English training samples we generated samples of English and German crossed. We call this _multilingual finetuning with language-crossing_. It doubled the traing-datasize and tests show that it further improves performance.
We did an automatic hyperparameter search for 33 trials with [Optuna](https://github.com/optuna/optuna). Using 10-fold crossvalidation on the deepl.com test and dev dataset we found the following best hyperparameter:
- batch_size = 8
- num_epochs = 2
- lr = 1.026343323298136e-05,
- eps = 4.462251033010287e-06
- weight_decay = 0.04794438776350409
- warmup_steps_proportion = 0.1609010732760181
The final model was trained with these hyperparameters on the combination of the train and dev datasets from English, German and the crossings of them. The testset was left for testing.
# Evaluation
The evaluation has been done on English, German and both languages crossed with the STSbenchmark test data. The evaluation-code is available on [Colab](https://colab.research.google.com/drive/1gtGnKq_dYU_sDYqMohTYVMVpxMJjyH0M?usp=sharing). As the metric for evaluation we use the Spearmans rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.
| Model Name | Spearman<br/>German | Spearman<br/>English | Spearman<br/>EN-DE & DE-EN<br/>(cross) |
|---------------------------------------------------------------|-------------------|--------------------|------------------|
| xlm-r-distilroberta-base-paraphrase-v1 | 0.8079 | 0.8350 | 0.7983 |
| [xlm-r-100langs-bert-base-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens) | 0.7877 | 0.8465 | 0.7908 |
| xlm-r-bert-base-nli-stsb-mean-tokens | 0.7877 | 0.8465 | 0.7908 |
| [roberta-large-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/roberta-large-nli-stsb-mean-tokens) | 0.6371 | 0.8639 | 0.4109 |
| [T-Systems-onsite/<br/>german-roberta-sentence-transformer-v2](https://huggingface.co/T-Systems-onsite/german-roberta-sentence-transformer-v2) | 0.8529 | 0.8634 | 0.8415 |
| **T-Systems-onsite/<br/>cross-en-de-roberta-sentence-transformer** | **0.8550** | **0.8660** | **0.8525** |

View File

@ -1,82 +0,0 @@
---
language: de
license: mit
tags:
- sentence_embedding
- search
- pytorch
- xlm-roberta
- roberta
- xlm-r-distilroberta-base-paraphrase-v1
- paraphrase
datasets:
- STSbenchmark
metrics:
- Spearmans rank correlation
- cosine similarity
---
# German RoBERTa for Sentence Embeddings V2
**The new [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model is slightly better for German language. It is also the current best model for English language and works cross-lingually. Please consider using that model.**
This model is intended to [compute sentence (text embeddings)](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html) for German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to find sentences with a similar semantic meaning. For example this can be useful for [semantic textual similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html), [semantic search](https://www.sbert.net/docs/usage/semantic_search.html), or [paraphrase mining](https://www.sbert.net/docs/usage/paraphrase_mining.html). To do this you have to use the [Sentence Transformers Python framework](https://github.com/UKPLab/sentence-transformers).
> Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Source: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
This model is fine-tuned from [Philip May](https://eniak.de/) and open-sourced by [T-Systems-onsite](https://www.t-systems-onsite.de/). Special thanks to [Nils Reimers](https://www.nils-reimers.de/) for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.
## How to use
**The usage description above - provided by Hugging Face - is wrong for sentence embeddings! Please use this:**
To use this model install the `sentence-transformers` package (see here: <https://github.com/UKPLab/sentence-transformers>).
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/german-roberta-sentence-transformer-v2')
```
For details of usage and examples see here:
- [Computing Sentence Embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Paraphrase Mining](https://www.sbert.net/docs/usage/paraphrase_mining.html)
- [Semantic Search](https://www.sbert.net/docs/usage/semantic_search.html)
- [Cross-Encoders](https://www.sbert.net/docs/usage/cross-encoder.html)
- [Examples on GitHub](https://github.com/UKPLab/sentence-transformers/tree/master/examples)
## Training
The base model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). This model has been further trained by [Nils Reimers](https://www.nils-reimers.de/) on a large scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils-reimers.de/) about this [on GitHub](https://github.com/UKPLab/sentence-transformers/issues/509#issuecomment-712243280):
>A paper is upcoming for the paraphrase models.
>
>These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.
>
>In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
>
>More details with the setup, all the datasets, and a wider evaluation will follow soon.
The resulting model called `xlm-r-distilroberta-base-paraphrase-v1` has been released here: <https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8>
Building on this cross language model we fine-tuned it for German language on the [deepl.com](https://www.deepl.com/translator) dataset of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark).
We did an automatic hyperparameter search for 102 trials with [Optuna](https://github.com/optuna/optuna). Using 10-fold crossvalidation on the deepl.com test and dev dataset we found the following best hyperparameters:
- batch_size = 15
- num_epochs = 4
- lr = 2.2995320905210864e-05
- eps = 1.8979875906303792e-06
- weight_decay = 0.003314045812507563
- warmup_steps_proportion = 0.46141685205829014
The final model was trained with these hyperparameters on the combination of `sts_de_train.csv` and `sts_de_dev.csv`. The `sts_de_test.csv` was left for testing.
# Evaluation
The evaluation has been done on the test set of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark). The code is available on [Colab](https://colab.research.google.com/drive/1aCWOqDQx953kEnQ5k4Qn7uiixokocOHv?usp=sharing). As the metric for evaluation we use the Spearmans rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.
| Model Name | Spearman rank correlation<br/>(German) |
|--------------------------------------|-------------------------------------|
| xlm-r-distilroberta-base-paraphrase-v1 | 0.8079 |
| xlm-r-100langs-bert-base-nli-stsb-mean-tokens | 0.8194 |
| xlm-r-bert-base-nli-stsb-mean-tokens | 0.8194 |
| **T-Systems-onsite/<br/>german-roberta-sentence-transformer-v2** | **0.8529** |
| **[T-Systems-onsite/<br/>cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer)** | **0.8550** |

View File

@ -1,39 +0,0 @@
---
language: uk
---
Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
## GPT2 124M Trained on Ukranian Fiction
### Training details
Model was trained on corpus of 4040 fiction books, 2.77 GiB in total.
Evaluation on [brown-uk](https://github.com/brown-uk/corpus) gives perplexity of 50.16.
### Example usage:
```python
from transformers import AlbertTokenizer, GPT2LMHeadModel
tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
input_ids = tokenizer.encode("Но зла Юнона, суча дочка,", add_special_tokens=False, return_tensors='pt')
outputs = model.generate(
input_ids,
do_sample=True,
num_return_sequences=3,
max_length=50
)
for i, out in enumerate(outputs):
print("{}: {}".format(i, tokenizer.decode(out)))
```
Prints something like this:
```bash
0: Но зла Юнона, суча дочка, яка затьмарила всі її таємниці: І хто з'їсть її душу, той помре». І, не дочекавшись гніву богів, посунула в пітьму, щоб не бачити перед собою. Але, за
1: Но зла Юнона, суча дочка, і довела мене до божевілля. Але він не знав нічого. Після того як я його побачив, мені стало зле. Я втратив рівновагу. Але в мене не було часу на роздуми. Я вже втратив надію
2: Но зла Юнона, суча дочка, не нарікала нам! — раптом вигукнула Юнона. — Це ти, старий йолопе! — мовила вона, не перестаючи сміятись. — Хіба ти не знаєш, що мені подобається ходити з тобою?
```

View File

@ -1,84 +0,0 @@
---
language: fi
---
## Quickstart
**Release 1.0** (November 25, 2019)
Download the models here:
* Cased Finnish BERT Base: [bert-base-finnish-cased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased-v1.zip)
* Uncased Finnish BERT Base: [bert-base-finnish-uncased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased-v1.zip)
We generally recommend the use of the cased model.
Paper presenting Finnish BERT: [arXiv:1912.07076](https://arxiv.org/abs/1912.07076)
## What's this?
A version of Google's [BERT](https://github.com/google-research/bert) deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) models from Google:
| Vocabulary | Example |
|------------|---------|
| FinBERT | Suomessa vaihtuu kesän aikana sekä pääministeri että valtiovarain ##ministeri . |
| Multilingual BERT | Suomessa vai ##htuu kes ##än aikana sekä p ##ää ##minister ##i että valt ##io ##vara ##in ##minister ##i . |
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
These features allow FinBERT to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
## Results
### Document classification
![learning curves for Yle and Ylilauta document classification](https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png)
FinBERT outperforms multilingual BERT (M-BERT) on document classification over a range of training set sizes on the Yle news (left) and Ylilauta online discussion (right) corpora. (Baseline classification performance with [FastText](https://fasttext.cc/) included for reference.)
[[code](https://github.com/spyysalo/finbert-text-classification)][[Yle data](https://github.com/spyysalo/yle-corpus)] [[Ylilauta data](https://github.com/spyysalo/ylilauta-corpus)]
### Named Entity Recognition
Evaluation on FiNER corpus ([Ruokolainen et al 2019](https://arxiv.org/abs/1908.04212))
| Model | Accuracy |
|--------------------|----------|
| **FinBERT** | **92.40%** |
| Multilingual BERT | 90.29% |
| [FiNER-tagger](https://github.com/Traubert/FiNer-rules) (rule-based) | 86.82% |
(FiNER tagger results from [Ruokolainen et al. 2019](https://arxiv.org/pdf/1908.04212.pdf))
[[code](https://github.com/jouniluoma/keras-bert-ner)][[data](https://github.com/mpsilfve/finer-data)]
### Part of speech tagging
Evaluation on three Finnish corpora annotated with [Universal Dependencies](https://universaldependencies.org/) part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)
| Model | TDT | FTB | PUD |
|-------------------|-------------|-------------|-------------|
| **FinBERT** | **98.23%** | **98.39%** | **98.08%** |
| Multilingual BERT | 96.97% | 95.87% | 97.58% |
[[code](https://github.com/spyysalo/bert-pos)][[data](http://hdl.handle.net/11234/1-2837)]
## Use with PyTorch
If you want to use the model with the huggingface/transformers library, follow the steps in [huggingface_transformers.md](https://github.com/TurkuNLP/FinBERT/blob/master/huggingface_transformers.md)
## Previous releases
### Release 0.2
**October 24, 2019** Beta version of the BERT base uncased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-uncased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip)
### Release 0.1
**September 30, 2019** We release a beta version of the BERT base cased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-cased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip)

View File

@ -1,84 +0,0 @@
---
language: fi
---
## Quickstart
**Release 1.0** (November 25, 2019)
Download the models here:
* Cased Finnish BERT Base: [bert-base-finnish-cased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased-v1.zip)
* Uncased Finnish BERT Base: [bert-base-finnish-uncased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased-v1.zip)
We generally recommend the use of the cased model.
Paper presenting Finnish BERT: [arXiv:1912.07076](https://arxiv.org/abs/1912.07076)
## What's this?
A version of Google's [BERT](https://github.com/google-research/bert) deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) models from Google:
| Vocabulary | Example |
|------------|---------|
| FinBERT | Suomessa vaihtuu kesän aikana sekä pääministeri että valtiovarain ##ministeri . |
| Multilingual BERT | Suomessa vai ##htuu kes ##än aikana sekä p ##ää ##minister ##i että valt ##io ##vara ##in ##minister ##i . |
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
These features allow FinBERT to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
## Results
### Document classification
![learning curves for Yle and Ylilauta document classification](https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png)
FinBERT outperforms multilingual BERT (M-BERT) on document classification over a range of training set sizes on the Yle news (left) and Ylilauta online discussion (right) corpora. (Baseline classification performance with [FastText](https://fasttext.cc/) included for reference.)
[[code](https://github.com/spyysalo/finbert-text-classification)][[Yle data](https://github.com/spyysalo/yle-corpus)] [[Ylilauta data](https://github.com/spyysalo/ylilauta-corpus)]
### Named Entity Recognition
Evaluation on FiNER corpus ([Ruokolainen et al 2019](https://arxiv.org/abs/1908.04212))
| Model | Accuracy |
|--------------------|----------|
| **FinBERT** | **92.40%** |
| Multilingual BERT | 90.29% |
| [FiNER-tagger](https://github.com/Traubert/FiNer-rules) (rule-based) | 86.82% |
(FiNER tagger results from [Ruokolainen et al. 2019](https://arxiv.org/pdf/1908.04212.pdf))
[[code](https://github.com/jouniluoma/keras-bert-ner)][[data](https://github.com/mpsilfve/finer-data)]
### Part of speech tagging
Evaluation on three Finnish corpora annotated with [Universal Dependencies](https://universaldependencies.org/) part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)
| Model | TDT | FTB | PUD |
|-------------------|-------------|-------------|-------------|
| **FinBERT** | **98.23%** | **98.39%** | **98.08%** |
| Multilingual BERT | 96.97% | 95.87% | 97.58% |
[[code](https://github.com/spyysalo/bert-pos)][[data](http://hdl.handle.net/11234/1-2837)]
## Use with PyTorch
If you want to use the model with the huggingface/transformers library, follow the steps in [huggingface_transformers.md](https://github.com/TurkuNLP/FinBERT/blob/master/huggingface_transformers.md)
## Previous releases
### Release 0.2
**October 24, 2019** Beta version of the BERT base uncased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-uncased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip)
### Release 0.1
**September 30, 2019** We release a beta version of the BERT base cased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-cased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip)

View File

@ -1,59 +0,0 @@
---
language: fr
widget:
- text: "Je m'appelle Hicham et je vis a Fès"
---
# MagBERT-NER: a state-of-the-art NER model for Moroccan French language (Maghreb)
## Introduction
[MagBERT-NER] is a state-of-the-art NER model for Moroccan French language (Maghreb). The MagBERT-NER model was fine-tuned for NER Task based the language model for French Camembert (based on the RoBERTa architecture).
For further information or requests, please visite our website at [typica.ai Website](https://typica.ai/) or send us an email at contactus@typica.ai
## How to use MagBERT-NER with HuggingFace
##### Load MagBERT-NER and its sub-word tokenizer :
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("TypicaAI/magbert-ner")
model = AutoModelForTokenClassification.from_pretrained("TypicaAI/magbert-ner")
##### Process text sample (from wikipedia about the current Prime Minister of Morocco) Using NER pipeline
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp("Saad Dine El Otmani, né le 16 janvier 1956 à Inezgane, est un homme d'État marocain, chef du gouvernement du Maroc depuis le 5 avril 2017")
#[{'entity_group': 'I-PERSON',
# 'score': 0.8941445276141167,
# 'word': 'Saad Dine El Otmani'},
# {'entity_group': 'B-DATE',
# 'score': 0.5967703461647034,
# 'word': '16 janvier 1956'},
# {'entity_group': 'B-GPE', 'score': 0.7160899192094803, 'word': 'Inezgane'},
# {'entity_group': 'B-NORP', 'score': 0.7971733212471008, 'word': 'marocain'},
# {'entity_group': 'B-GPE', 'score': 0.8921478390693665, 'word': 'Maroc'},
# {'entity_group': 'B-DATE',
# 'score': 0.5760444005330404,
# 'word': '5 avril 2017'}]
```
## Authors
MagBert-NER Model was trained by Hicham Assoudi, Ph.D.
For any questions, comments you can contact me at assoudi@typica.ai
## Citation
If you use our work, please cite:
Hicham Assoudi, Ph.D., MagBERT-NER: a state-of-the-art NER model for Moroccan French language (Maghreb), (2020)

View File

@ -1,51 +0,0 @@
---
language: "en"
tags:
- paraphrase-generation
- text-generation
- Conditional Generation
inference: false
---
# Paraphrase-Generation
## Model description
T5 Model for generating paraphrases of english sentences. Trained on the [Google PAWS](https://github.com/google-research-datasets/paws) dataset.
## How to use
PyTorch and TF models available
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")
model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
sentence = "This is something which i cannot understand at all"
text = "paraphrase: " + sentence + " </s>"
encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
do_sample=True,
top_k=120,
top_p=0.95,
early_stopping=True,
num_return_sequences=5
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
print(line)
```
For more reference on training your own T5 model or using this model, do check out [Paraphrase Generation](https://github.com/Vamsi995/Paraphrase-Generator).

View File

@ -1,25 +0,0 @@
---
language: en
datasets:
- yelp_polarity
---
# RoBERTa-base-finetuned-yelp-polarity
This is a [RoBERTa-base](https://huggingface.co/roberta-base) checkpoint fine-tuned on binary sentiment classifcation from [Yelp polarity](https://huggingface.co/nlp/viewer/?dataset=yelp_polarity).
It gets **98.08%** accuracy on the test set.
## Hyper-parameters
We used the following hyper-parameters to train the model on one GPU:
```python
num_train_epochs = 2.0
learning_rate = 1e-05
weight_decay = 0.0
adam_epsilon = 1e-08
max_grad_norm = 1.0
per_device_train_batch_size = 32
gradient_accumulation_steps = 1
warmup_steps = 3500
seed = 42
```

View File

@ -1,25 +0,0 @@
---
language: no
thumbnail: https://i.imgur.com/QqSEC5I.png
---
# Norwegian Electra
![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)
Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
# Model
## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
- https://openreview.net/pdf?id=r1xMH1BtvB
- https://github.com/google-research/electra
# Acknowledgments
### TensorFlow Research Cloud
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
- https://www.tensorflow.org/tfrc
#### OSCAR corpus
- https://oscar-corpus.com/
#### OPUS
- http://opus.nlpl.eu/
- http://www.opensubtitles.org/

View File

@ -1,76 +0,0 @@
---
datasets:
- squad_v2
---
# BART-LARGE finetuned on SQuADv2
This is bart-large model finetuned on SQuADv2 dataset for question answering task
## Model details
BART was propsed in the [paper](https://arxiv.org/abs/1910.13461) **BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension**.
BART is a seq2seq model intended for both NLG and NLU tasks.
To use BART for question answering tasks, we feed the complete document into the encoder and decoder, and use the top
hidden state of the decoder as a representation for each
word. This representation is used to classify the token. As given in the paper bart-large achives comparable to ROBERTa on SQuAD.
Another notable thing about BART is that it can handle sequences with upto 1024 tokens.
| Param | #Value |
|---------------------|--------|
| encoder layers | 12 |
| decoder layers | 12 |
| hidden size | 4096 |
| num attetion heads | 16 |
| on disk size | 1.63GB |
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'doc_stride': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 8,
'num_train_epochs': 2,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 32,
'fp16_opt_level': 'O2',
}
```
[You can even train your own model using this colab notebook](https://colab.research.google.com/drive/1I5cK1M_0dLaf5xoewh6swcm5nAInfwHy?usp=sharing)
## Results
```{"correct": 6832, "similar": 4409, "incorrect": 632, "eval_loss": -14.950117511952177}```
## Model in Action 🚀
```python3
from transformers import BartTokenizer, BartForQuestionAnswering
import torch
tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
#answer => 'a nice puppet'
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)

View File

@ -1,48 +0,0 @@
---
datasets:
- squad_v2
---
# Roberta-LARGE finetuned on SQuADv2
This is roberta-large model finetuned on SQuADv2 dataset for question answering answerability classification
## Model details
This model is simply an Sequenceclassification model with two inputs (context and question) in a list.
The result is either [1] for answerable or [0] if it is not answerable.
It was trained over 4 epochs on squadv2 dataset and can be used to filter out which context is good to give into the QA model to avoid bad answers.
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 4,
'num_train_epochs': 4,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 8,
'fp16_opt_level': 'O2',
}
```
## Results
```{"accuracy": 90.48%}```
## Model in Action 🚀
```python3
from simpletransformers.classification import ClassificationModel
model = ClassificationModel('roberta', 'a-ware/roberta-large-squadv2', num_labels=2, args=train_args)
predictions, raw_outputs = model.predict([["my dog is an year old. he loves to go into the rain", "how old is my dog ?"]])
print(predictions)
==> [1]
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)

View File

@ -1,59 +0,0 @@
---
datasets:
- squad_v2
---
# XLM-ROBERTA-LARGE finetuned on SQuADv2
This is xlm-roberta-large model finetuned on SQuADv2 dataset for question answering task
## Model details
XLM-Roberta was propsed in the [paper](https://arxiv.org/pdf/1911.02116.pdf) **XLM-R: State-of-the-art cross-lingual understanding through self-supervision
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'doc_stride': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 8,
'num_train_epochs': 2,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 32,
'fp16_opt_level': 'O2',
}
```
## Results
```{"correct": 6961, "similar": 4359, "incorrect": 553, "eval_loss": -12.177856394381962}```
## Model in Action 🚀
```python3
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
#answer => 'a nice puppet'
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)

View File

@ -1,132 +0,0 @@
---
tags:
- finance
---
# Roberta Masked Language Model Trained On Financial Phrasebank Corpus
This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a Financial Phrasebank Corpus.
The model is built using Huggingface transformers.
The model can be found at :[Financial_Roberta](https://huggingface.co/abhilash1910/financial_roberta)
## Specifications
The corpus for training is taken from the Financial Phrasebank (Malo et al)[https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts].
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=56000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
This is trained by using RobertaConfig from transformers package.
The model is trained for 10 epochs with a gpu batch size of 64 units.
## Usage Specifications
For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/financial_roberta' for the tokenizers and the model.
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("abhilash1910/financial_roberta")
model = AutoModelWithLMHead.from_pretrained("abhilash1910/financial_roberta")
```
After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import pipeline module from transformers and create a masked output model for inference as follows:
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/inancial_roberta')
model_mask("The company had a <mask> of 20% in 2020.")
```
Some of the examples are also provided with generic financial statements:
Example 1:
```python
model_mask("The company had a <mask> of 20% in 2020.")
```
Output:
```bash
[{'sequence': '<s>The company had a profit of 20% in 2020.</s>',
'score': 0.023112965747714043,
'token': 421,
'token_str': 'Ġprofit'},
{'sequence': '<s>The company had a loss of 20% in 2020.</s>',
'score': 0.021379893645644188,
'token': 616,
'token_str': 'Ġloss'},
{'sequence': '<s>The company had a year of 20% in 2020.</s>',
'score': 0.0185744296759367,
'token': 443,
'token_str': 'Ġyear'},
{'sequence': '<s>The company had a sales of 20% in 2020.</s>',
'score': 0.018143286928534508,
'token': 428,
'token_str': 'Ġsales'},
{'sequence': '<s>The company had a value of 20% in 2020.</s>',
'score': 0.015319528989493847,
'token': 776,
'token_str': 'Ġvalue'}]
```
Example 2:
```python
model_mask("The <mask> is listed under NYSE")
```
Output:
```bash
[{'sequence': '<s>The company is listed under NYSE</s>',
'score': 0.1566661298274994,
'token': 359,
'token_str': 'Ġcompany'},
{'sequence': '<s>The total is listed under NYSE</s>',
'score': 0.05542507395148277,
'token': 522,
'token_str': 'Ġtotal'},
{'sequence': '<s>The value is listed under NYSE</s>',
'score': 0.04729423299431801,
'token': 776,
'token_str': 'Ġvalue'},
{'sequence': '<s>The order is listed under NYSE</s>',
'score': 0.02533523552119732,
'token': 798,
'token_str': 'Ġorder'},
{'sequence': '<s>The contract is listed under NYSE</s>',
'score': 0.02087237872183323,
'token': 635,
'token_str': 'Ġcontract'}]
```
## Resources
For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).

View File

@ -1,131 +0,0 @@
# Roberta Trained Model For Masked Language Model On French Corpus :robot:
This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a small French News Corpus(Leipzig corpora).
The model is built using Huggingface transformers.
The model can be found at :[French-Roberta](https://huggingface.co/abhilash1910/french-roberta)
## Specifications
The corpus for training is taken from Leipzig Corpora (French News) , and is trained on a small set of the corpus (300K).
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=32000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
This is trained by using RobertaConfig from transformers package.The total training parameters :68124416
The model is trained for 100 epochs with a gpu batch size of 64 units.
More details for building custom models can be found at the [HuggingFace Blog](https://huggingface.co/blog/how-to-train)
## Usage Specifications
For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/french-roberta' for the tokenizers and the model.
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("abhilash1910/french-roberta")
model = AutoModelWithLMHead.from_pretrained("abhilash1910/french-roberta")
```
After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import pipeline module from transformers and create a masked output model for inference as follows:
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/french-roberta')
model_mask("Le tweet <mask>.")
```
Some of the examples are also provided with generic French sentences:
Example 1:
```python
model_mask("À ce jour, <mask> projet a entraîné")
```
Output:
```bash
[{'sequence': '<s>À ce jour, belles projet a entraîné</s>',
'score': 0.18685665726661682,
'token': 6504,
'token_str': 'Ġbelles'},
{'sequence': '<s>À ce jour,- projet a entraîné</s>',
'score': 0.0005200508167035878,
'token': 17,
'token_str': '-'},
{'sequence': '<s>À ce jour, de projet a entraîné</s>',
'score': 0.00045729897101409733,
'token': 268,
'token_str': 'Ġde'},
{'sequence': '<s>À ce jour, du projet a entraîné</s>',
'score': 0.0004307595663703978,
'token': 326,
'token_str': 'Ġdu'},
{'sequence': '<s>À ce jour," projet a entraîné</s>',
'score': 0.0004219160182401538,
'token': 6,
'token_str': '"'}]
```
Example 2:
```python
model_mask("C'est un <mask>")
```
Output:
```bash
[{'sequence': "<s>C'est un belles</s>",
'score': 0.16440927982330322,
'token': 6504,
'token_str': 'Ġbelles'},
{'sequence': "<s>C'est un de</s>",
'score': 0.0005495127406902611,
'token': 268,
'token_str': 'Ġde'},
{'sequence': "<s>C'est un du</s>",
'score': 0.00044988933950662613,
'token': 326,
'token_str': 'Ġdu'},
{'sequence': "<s>C'est un-</s>",
'score': 0.00044542422983795404,
'token': 17,
'token_str': '-'},
{'sequence': "<s>C'est un\t</s>",
'score': 0.00037563967634923756,
'token': 202,
'token_str': 'ĉ'}]
```
## Resources
For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).

View File

@ -1,43 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,41 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,41 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,42 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,44 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.
`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased`.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,44 +0,0 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
model = AutoModel.from_pretrained("activebus/BERT_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@ -1,37 +0,0 @@
---
language: pt
---
# PTT5-SMALL-SUM
## Model description
This model was trained to summarize texts in portuguese
based on ```unicamp-dl/ptt5-small-portuguese-vocab```
#### How to use
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
t5 = T5ForConditionalGeneration.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
text="Esse é um exemplo de sumarização."
input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)
generated_ids = t5.generate(
input_ids=input_ids,
num_beams=1,
max_length=40,
#repetition_penalty=2.5
).squeeze()
predicted_span = tokenizer.decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
```

Some files were not shown because too many files have changed in this diff Show More