mirror of
https://github.com/huggingface/transformers.git
synced 2025-08-01 10:41:07 +06:00
[s2s README] Add more dataset download instructions (#6737)
This commit is contained in:
parent
32fe44086c
commit
dfa10a41ba
@ -6,8 +6,9 @@ Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
|
|||||||
For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
|
For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
|
||||||
|
|
||||||
|
|
||||||
### Data
|
## Datasets
|
||||||
XSUM Data:
|
|
||||||
|
#### XSUM:
|
||||||
```bash
|
```bash
|
||||||
cd examples/seq2seq
|
cd examples/seq2seq
|
||||||
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
|
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
|
||||||
@ -17,23 +18,33 @@ export XSUM_DIR=${PWD}/xsum
|
|||||||
this should make a directory called `xsum/` with files like `test.source`.
|
this should make a directory called `xsum/` with files like `test.source`.
|
||||||
To use your own data, copy that files format. Each article to be summarized is on its own line.
|
To use your own data, copy that files format. Each article to be summarized is on its own line.
|
||||||
|
|
||||||
CNN/DailyMail data
|
#### CNN/DailyMail
|
||||||
```bash
|
```bash
|
||||||
cd examples/seq2seq
|
cd examples/seq2seq
|
||||||
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
|
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm_v2.tgz
|
||||||
tar -xzvf cnn_dm.tgz
|
tar -xzvf cnn_dm_v2.tgz # empty lines removed
|
||||||
|
mv cnn_cln cnn_dm
|
||||||
export CNN_DIR=${PWD}/cnn_dm
|
export CNN_DIR=${PWD}/cnn_dm
|
||||||
this should make a directory called `cnn_dm/` with files like `test.source`.
|
|
||||||
```
|
```
|
||||||
|
this should make a directory called `cnn_dm/` with 6 files.
|
||||||
|
|
||||||
WMT16 English-Romanian Translation Data:
|
#### WMT16 English-Romanian Translation Data:
|
||||||
download with this command:
|
download with this command:
|
||||||
```bash
|
```bash
|
||||||
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
|
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
|
||||||
tar -xzvf wmt_en_ro.tar.gz
|
tar -xzvf wmt_en_ro.tar.gz
|
||||||
export ENRO_DIR=${PWD}/wmt_en_ro
|
export ENRO_DIR=${PWD}/wmt_en_ro
|
||||||
this should make a directory called `wmt_en_ro/` with files like `test.source`.
|
|
||||||
```
|
```
|
||||||
|
this should make a directory called `wmt_en_ro/` with 6 files.
|
||||||
|
|
||||||
|
#### WMT English-German:
|
||||||
|
```bash
|
||||||
|
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
|
||||||
|
tar -xzvf wmt_en_de.tar.gz
|
||||||
|
export DATA_DIR=${PWD}/wmt_en_de
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Private Data
|
||||||
|
|
||||||
If you are using your own data, it must be formatted as one directory with 6 files:
|
If you are using your own data, it must be formatted as one directory with 6 files:
|
||||||
```
|
```
|
||||||
@ -75,7 +86,8 @@ Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prep
|
|||||||
Future work/help wanted: A new dataset to support multilingual tasks.
|
Future work/help wanted: A new dataset to support multilingual tasks.
|
||||||
|
|
||||||
|
|
||||||
### Command Line Options
|
### Finetuning Scripts
|
||||||
|
All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.
|
||||||
|
|
||||||
To see all the possible command line options, run:
|
To see all the possible command line options, run:
|
||||||
|
|
||||||
@ -110,6 +122,8 @@ The following command should work on a 16GB GPU:
|
|||||||
--model_name_or_path facebook/bart-large
|
--model_name_or_path facebook/bart-large
|
||||||
```
|
```
|
||||||
|
|
||||||
|
There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.
|
||||||
|
|
||||||
### Translation Finetuning
|
### Translation Finetuning
|
||||||
|
|
||||||
First, follow the wmt_en_ro download instructions.
|
First, follow the wmt_en_ro download instructions.
|
||||||
|
Loading…
Reference in New Issue
Block a user