transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-30 17:52:35 +06:00

History

Patrick von Platen 01c37dcdb5 [Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734 ) * remove output_past from pt * make style * add optional input length for gpt2 * add use cache to prepare input * save memory in gpt2 * correct gpt2 test inputs * make past input optional for gpt2 * finish use_cache for all models * make style * delete modeling_gpt2 change in test file * correct docstring * correct is true statements for gpt2		2020-04-14 14:40:28 -04:00
..
__init__.py	Summarization Examples: add Bart CNN Evaluation (#3082 )	2020-03-03 15:29:59 -05:00
evaluate_cnn.py	[Bart] Replace config.output_past with use_cache kwarg (#3632 )	2020-04-07 19:08:26 -04:00
README.md	[Docs] examples/summarization/bart: Simplify CNN/DM preprocessi… (#3516 )	2020-03-29 13:25:42 -04:00
run_bart_sum.py	[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734 )	2020-04-14 14:40:28 -04:00
run_train.sh	BART for summarization training with CNN/DM using pytorch-lightning	2020-03-24 21:00:24 -04:00
test_bart_examples.py	[examples] SummarizationDataset cleanup (#3451 )	2020-04-07 19:05:58 -04:00
utils.py	[examples] SummarizationDataset cleanup (#3451 )	2020-04-07 19:05:58 -04:00

README.md

Get Preprocessed CNN Data

To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets from Kyunghyun Cho's website (the links next to "Stories") in the same folder. Then uncompress the archives by running:

wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz

this should make a directory called cnn_dm/ with files like test.source. To use your own data, copy that files format. Each article to be summarized is on its own line.

Evaluation

To create summaries for each article in dataset, run:

python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt

the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.

Training

Run/modify run_train.sh

Where is the code?

The core model is in src/transformers/modeling_bart.py. This directory only contains examples.

(WIP) Rouge Scores

Stanford CoreNLP Setup

ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
export CLASSPATH=stanford-corenlp-3.9.2.jar:stanford-corenlp-3.9.2-models.jar

Then run ptb_tokenize on test.target and your generated hypotheses.

Rouge Setup

Install files2rouge following the instructions at here. I also needed to run sudo apt-get install libxml-parser-perl

from files2rouge import files2rouge
from files2rouge import settings
files2rouge.run(<path_to_tokenized_hypo>,
                <path_to_tokenized_target>,
               saveto='rouge_output.txt')