# PhoBERT

## Overview The PhoBERT model was proposed in [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92.pdf) by Dat Quoc Nguyen, Anh Tuan Nguyen. The abstract from the paper is the following: *We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.* This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT). ## Usage example ```python >>> import torch >>> from transformers import AutoModel, AutoTokenizer >>> phobert = AutoModel.from_pretrained("vinai/phobert-base") >>> tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base") >>> # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED! >>> line = "Tôi là sinh_viên trường đại_học Công_nghệ ." >>> input_ids = torch.tensor([tokenizer.encode(line)]) >>> with torch.no_grad(): ... features = phobert(input_ids) # Models outputs are now tuples >>> # With TensorFlow 2.0+: >>> # from transformers import TFAutoModel >>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base") ``` PhoBERT implementation is the same as BERT, except for tokenization. Refer to [BERT documentation](bert) for information on configuration classes and their parameters. PhoBERT-specific tokenizer is documented below. ## PhobertTokenizer [[autodoc]] PhobertTokenizer