transformers/examples/summarization
Anthony MOI 36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
..
bertabs Kill model archive maps (#4636) 2020-06-02 09:39:33 -04:00
__init__.py Summarization Examples: add Bart CNN Evaluation (#3082) 2020-03-03 15:29:59 -05:00
evaluate_cnn.py [examples] consolidate summarization examples (#4837) 2020-06-09 11:14:12 -04:00
finetune_bart_tiny.sh [examples] Cleanup summarization docs (#4876) 2020-06-09 17:38:28 -04:00
finetune_bart.sh [examples] Cleanup summarization docs (#4876) 2020-06-09 17:38:28 -04:00
finetune_t5.sh [examples] Cleanup summarization docs (#4876) 2020-06-09 17:38:28 -04:00
finetune.py [examples] consolidate summarization examples (#4837) 2020-06-09 11:14:12 -04:00
README.md [examples] Cleanup summarization docs (#4876) 2020-06-09 17:38:28 -04:00
test_summarization_examples.py [examples] consolidate summarization examples (#4837) 2020-06-09 11:14:12 -04:00
utils.py [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) 2020-06-15 17:12:51 -04:00

Get CNN Data

To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets from Kyunghyun Cho's website (the links next to "Stories") in the same folder. Then uncompress the archives by running:

wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz

this should make a directory called cnn_dm/ with files like test.source. To use your own data, copy that files format. Each article to be summarized is on its own line.

Evaluation

To create summaries for each article in dataset, run:

python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name>  --score_path rouge_scores.txt

The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.

Training

Run/modify finetune_bart.sh or finetune_t5.sh

Stanford CoreNLP Setup

ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
export CLASSPATH=stanford-corenlp-3.9.2.jar:stanford-corenlp-3.9.2-models.jar

Then run ptb_tokenize on test.target and your generated hypotheses.

Rouge Setup

Install files2rouge following the instructions at here. I also needed to run sudo apt-get install libxml-parser-perl

from files2rouge import files2rouge
from files2rouge import settings
files2rouge.run(<path_to_tokenized_hypo>,
                <path_to_tokenized_target>,
               saveto='rouge_output.txt')