mirror of https://github.com/huggingface/transformers.git synced 2025-08-02 19:21:31 +06:00

History

Julien Chaumond b23d3a5ad4 [model_cards] Switch all languages codes to ISO-639-{1,2,3}		2020-07-15 18:59:20 +02:00
..
README.md	[model_cards] Switch all languages codes to ISO-639-{1,2,3}	2020-07-15 18:59:20 +02:00

README.md

language
ro

bert-base-romanian-cased-v1

The BERT base, cased model for Romanian, trained on a 15GB corpus, version

How to use

from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Evaluation

Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.

The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased, as at the time of writing it was the only available BERT model that works on Romanian.

Model	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

Corpus

The model is trained on the following corpora (stats in the table below are after cleaning):

Corpus	Lines(M)	Words(M)	Chars(B)	Size(GB)
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
Wikipedia	1.54	60.47	0.411	0.4
Total	90.15	2421.33	15.867	15.2

Acknowledgements

We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!