![]() * Add minimum version check in examples * Style * No need for new line maybe? * Add helpful comment |
||
---|---|---|
.. | ||
README.md | ||
requirements.txt | ||
run_ner.py | ||
run.sh |
Token classification
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
tagging (POS). The main scrip run_ner.py
leverages the 🤗 Datasets library and the Trainer API. You can easily
customize it to your needs if you need extra processing on your datasets.
It will either run on a datasets hosted on our hub or with your own text files for training and validation.
The following example fine-tunes BERT on CoNLL-2003:
python run_ner.py \
--model_name_or_path bert-base-uncased \
--dataset_name conll2003 \
--output_dir /tmp/test-ner \
--do_train \
--do_eval
or just can just run the bash script run.sh
.
To run on your own training and validation files, use the following command:
python run_ner.py \
--model_name_or_path bert-base-uncased \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--output_dir /tmp/test-ner \
--do_train \
--do_eval
Note: This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in this table, if it doesn't you can still use the old version of the script.
Old version of the script
You can find the old version of the PyTorch script here.
TensorFlow version
The following examples are covered in this section:
- NER on the GermEval 2014 (German NER) dataset
- Emerging and Rare Entities task: WNUT’17 (English NER) dataset
Details and results for the fine-tuning provided by @stefan-it.
GermEval 2014 (German NER) dataset
Data (Download and pre-processing steps)
Data can be obtained from the GermEval 2014 shared task page.
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
The GermEval 2014 dataset contains some strange "control character" tokens like '\x96', '\u200e', '\x95', '\xad' or '\x80'
.
One problem with these tokens is, that BertTokenizer
returns an empty token for them, resulting in misaligned InputExample
s.
The preprocess.py
script located in the scripts
folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
Let's define some variables that we need for further pre-processing steps and training the model:
export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased
Run the pre-processing script on training, dev and test datasets:
python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
Prepare the run
Additional environment variables must be set:
export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1