From dfa10a41ba3fd9c5289bebd3baeff8792b1b2281 Mon Sep 17 00:00:00 2001
From: Sam Shleifer <sshleifer@gmail.com>
Date: Sun, 30 Aug 2020 16:29:24 -0400
Subject: [PATCH] [s2s README] Add more dataset download instructions (#6737)

---
 examples/seq2seq/README.md | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/examples/seq2seq/README.md b/examples/seq2seq/README.md
index db047fe9562..3a50c373218 100644
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -6,8 +6,9 @@ Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
 For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
 
 
-### Data
-XSUM Data:
+## Datasets
+
+#### XSUM:
 ```bash
 cd examples/seq2seq
 wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
@@ -17,23 +18,33 @@ export XSUM_DIR=${PWD}/xsum
 this should make a directory called `xsum/` with files like `test.source`.
 To use your own data, copy that files format. Each article to be summarized is on its own line.
 
-CNN/DailyMail data
+#### CNN/DailyMail
 ```bash
 cd examples/seq2seq
-wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
-tar -xzvf cnn_dm.tgz
+wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm_v2.tgz
+tar -xzvf cnn_dm_v2.tgz  # empty lines removed
+mv cnn_cln cnn_dm
 export CNN_DIR=${PWD}/cnn_dm
-this should make a directory called `cnn_dm/` with files like `test.source`.
 ```
+this should make a directory called `cnn_dm/` with 6 files.
 
-WMT16 English-Romanian Translation Data:
+#### WMT16 English-Romanian Translation Data:
 download with this command:
 ```bash
 wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
 tar -xzvf wmt_en_ro.tar.gz
 export ENRO_DIR=${PWD}/wmt_en_ro
-this should make a directory called `wmt_en_ro/` with files like `test.source`.
 ```
+this should make a directory called `wmt_en_ro/` with 6 files.
+
+#### WMT English-German:
+```bash
+wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
+tar -xzvf wmt_en_de.tar.gz
+export DATA_DIR=${PWD}/wmt_en_de
+```
+
+#### Private Data
 
 If you are using your own data, it must be formatted as one directory with 6 files: 
 ```
@@ -75,7 +86,8 @@ Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prep
 Future work/help wanted: A new dataset to support multilingual tasks.
 
 
-### Command Line Options
+### Finetuning Scripts
+All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.
 
 To see all the possible command line options, run:
 
@@ -110,6 +122,8 @@ The following command should work on a 16GB GPU:
     --model_name_or_path facebook/bart-large
 ```
 
+There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.
+
 ### Translation Finetuning
 
 First, follow the wmt_en_ro download instructions.