transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-15 02:28:24 +06:00

History

Sylvain Gugger 4c379daf64 Add minimum version check in examples (#10724 ) * Add minimum version check in examples * Style * No need for new line maybe? * Add helpful comment		2021-03-15 19:29:54 -04:00
..
README.md	Move the TF NER example (#10276 )	2021-02-19 16:06:13 -05:00
requirements.txt	Reorganize examples (#9010 )	2020-12-11 10:07:02 -05:00
run_ner.py	Add minimum version check in examples (#10724 )	2021-03-15 19:29:54 -04:00
run.sh	Reorganize examples (#9010 )	2020-12-11 10:07:02 -05:00

README.md

Token classification

Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech tagging (POS). The main scrip run_ner.py leverages the 🤗 Datasets library and the Trainer API. You can easily customize it to your needs if you need extra processing on your datasets.

It will either run on a datasets hosted on our hub or with your own text files for training and validation.

The following example fine-tunes BERT on CoNLL-2003:

python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

or just can just run the bash script run.sh.

To run on your own training and validation files, use the following command:

python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --train_file path_to_train_file \
  --validation_file path_to_validation_file \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

Note: This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in this table, if it doesn't you can still use the old version of the script.

Old version of the script

You can find the old version of the PyTorch script here.

TensorFlow version

The following examples are covered in this section:

NER on the GermEval 2014 (German NER) dataset
Emerging and Rare Entities task: WNUT’17 (English NER) dataset

Details and results for the fine-tuning provided by @stefan-it.

GermEval 2014 (German NER) dataset

Data (Download and pre-processing steps)

Data can be obtained from the GermEval 2014 shared task page.

Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:

curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp

The GermEval 2014 dataset contains some strange "control character" tokens like '\x96', '\u200e', '\x95', '\xad' or '\x80'. One problem with these tokens is, that BertTokenizer returns an empty token for them, resulting in misaligned InputExamples. The preprocess.py script located in the scripts folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).

Let's define some variables that we need for further pre-processing steps and training the model:

export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased

Run the pre-processing script on training, dev and test datasets:

python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt

The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:

cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

Prepare the run

Additional environment variables must be set:

export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1

README.md Unescape Escape