diff --git a/README.md b/README.md index d7e9d1f7050..3dca9e294d7 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ Here is a detailed documentation of the classes in the package and how to use th ### Loading Google AI's pre-trained weigths and PyTorch dump -To load Google AI's pre-trained weight or a PyTorch saved instance of `BertForPreTraining`, the PyTorch model classes and the tokenizer can be instantiated as +To load one of Google AI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated as ```python model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH) @@ -180,8 +180,9 @@ where - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters - a path or url to a pretrained model archive containing: - . `bert_config.json` a configuration file for the model - . `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`) + + - `bert_config.json` a configuration file for the model, and + - `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`) If `PRE_TRAINED_MODEL_NAME` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`). @@ -304,15 +305,15 @@ Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretra The optimizer accepts the following arguments: - `lr` : learning rate -- `warmup` : portion of t_total for the warmup, -1 means no warmup. Default : -1 +- `warmup` : portion of `t_total` for the warmup, `-1` means no warmup. Default : `-1` - `t_total` : total number of training steps for the learning - rate schedule, -1 means constant learning rate. Default : -1 -- `schedule` : schedule to use for the warmup (see above). Default : 'warmup_linear' -- `b1` : Adams b1. Default : 0.9 -- `b2` : Adams b2. Default : 0.999 -- `e` : Adams epsilon. Default : 1e-6 -- `weight_decay_rate:` Weight decay. Default : 0.01 -- `max_grad_norm` : Maximum norm for the gradients (-1 means no clipping). Default : 1.0 + rate schedule, `-1` means constant learning rate. Default : `-1` +- `schedule` : schedule to use for the warmup (see above). Default : `'warmup_linear'` +- `b1` : Adams b1. Default : `0.9` +- `b2` : Adams b2. Default : `0.999` +- `e` : Adams epsilon. Default : `1e-6` +- `weight_decay_rate:` Weight decay. Default : `0.01` +- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0` ## Examples @@ -467,21 +468,19 @@ The results were similar to the above FP32 results (actually slightly higher): ## Notebooks -Comparing the PyTorch model and the TensorFlow model predictions - -We also include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model. +We include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model. - The first NoteBook ([Comparing-TF-and-PT-models.ipynb](./notebooks/Comparing-TF-and-PT-models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models. - The second NoteBook ([Comparing-TF-and-PT-models-SQuAD.ipynb](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models. -- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token using the pre-trained masked language modeling model. +- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model. Please follow the instructions given in the notebooks to run and modify them. ## Command-line interface -A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch checkpoint +A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class (see above). You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.