diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md index d2499651cd4..a479fd67163 100644 --- a/examples/language-modeling/README.md +++ b/examples/language-modeling/README.md @@ -22,8 +22,7 @@ ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tu loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those objectives in our [model summary](https://huggingface.co/transformers/model_summary.html). -These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you -need extra processing on your datasets. +There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. **Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py). @@ -60,6 +59,15 @@ python run_clm.py \ --output_dir /tmp/test-clm ``` +This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_clm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below: + +```bash +python run_clm_no_trainer.py \ + --dataset_name wikitext \ + --dataset_config_name wikitext-2-raw-v1 \ + --model_name_or_path gpt2 \ + --output_dir /tmp/test-clm +``` ### RoBERTa/BERT/DistilBERT and masked language modeling @@ -95,23 +103,33 @@ python run_mlm.py \ If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script concatenates all texts and then splits them in blocks of the same length). +This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_mlm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below: + +```bash +python run_mlm_no_trainer.py \ + --dataset_name wikitext \ + --dataset_config_name wikitext-2-raw-v1 \ + --model_name_or_path roberta-base \ + --output_dir /tmp/test-mlm +``` + **Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make sure all your batches have the same length. ### Whole word masking -This part was moved to `examples/research_projects/mlm_wwm`. +This part was moved to `examples/research_projects/mlm_wwm`. ### XLNet and permutation language modeling -XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method -to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input +XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method +to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order. -We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding +We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding context length for permutation language modeling. -The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used +The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used for permutation language modeling. Here is how to fine-tune XLNet on wikitext-2: diff --git a/examples/language-modeling/run_mlm_no_trainer.py b/examples/language-modeling/run_mlm_no_trainer.py index a943bfd4a71..71a3bbe0c5a 100755 --- a/examples/language-modeling/run_mlm_no_trainer.py +++ b/examples/language-modeling/run_mlm_no_trainer.py @@ -56,7 +56,7 @@ MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) def parse_args(): - parser = argparse.ArgumentParser(description="Finetune a transformers model on a text classification task") + parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task") parser.add_argument( "--dataset_name", type=str,