Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor formatting adjustments. Added "TODO Lysandre filled" where necessary.

2025-07-31 02:02:21 +06:00 · 2019-07-10 14:45:56 -04:00 · 2019-07-10 14:45:56 -04:00 · f773faa258
commit f773faa258
parent 3f56ad5aff
19 changed files with 235 additions and 153 deletions
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
@ -6,6 +6,7 @@
 /* To keep the logo centered */
 .wy-side-scroll {
    width: auto;
+    font-size: 20px;
 }

 /* The div that holds the Hugging Face logo */
@ -131,7 +132,7 @@ a {
 /* FONTS */
 body{
    font-family: Calibre;
-    font-size: 20px;
+    font-size: 16px;
 }

 h1 {
--- a/docs/source/bertology.md
+++ b/docs/source/bertology.md
@ -1 +0,0 @@
-# Bertology
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
@ -0,0 +1,18 @@
+BERTology
+---------
+
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+
+
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+
+
+* accessing all the hidden-states of BERT/GPT/GPT-2,
+* accessing all the attention weights for each head of BERT/GPT/GPT-2,
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+
+To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/bertology.py>`_ while extract information and prune a model pre-trained on MRPC.
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@ -1,4 +1,4 @@
-CLI
+Converting Tensorflow Models
 ================================================

 A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
@ -6,9 +6,9 @@ A command-line interface is provided to convert a TensorFlow checkpoint in a PyT
 BERT
 ^^^^

-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `\ ``convert_tf_checkpoint_to_pytorch.py`` <./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.

-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `\ ``run_bert_extract_features.py`` <./examples/run_bert_extract_features.py>`_\ , `\ ``run_bert_classifier.py`` <./examples/run_bert_classifier.py>`_ and `\ ``run_bert_squad.py`` <./examples/run_bert_squad.py>`_\ ).
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).

 You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.

--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@ -6,22 +6,24 @@ Examples

   * - Sub-section
     - Description
-   * - `Training large models: introduction, tools and examples <#Training-large-models-introduction,-tools-and-examples>`_
+   * - `Training large models: introduction, tools and examples <#introduction>`_
     - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
-   * - `Fine-tuning with BERT: running the examples <#Fine-tuning-with-BERT-running-the-examples>`_
-     - Running the examples in `\ ``./examples`` <./examples/>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
-   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#openai-gpt-transformer-xl-and-gpt-2-running-the-examples>`_
-     - Running the examples in `\ ``./examples`` <./examples/>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
-   * - `Fine-tuning BERT-large on GPUs <#Fine-tuning-BERT-large-on-GPUs>`_
+   * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
+   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
+   * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
     - How to fine tune ``BERT large``


+.. _introduction:
+
 Training large models: introduction, tools and examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).

-To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `\ ``run_bert_classifier.py`` <./examples/run_bert_classifier.py>`_ and `\ ``run_bert_squad.py`` <./examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this month.
+To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this year.

 Here is how to use these techniques in our scripts:

@ -33,7 +35,7 @@ Here is how to use these techniques in our scripts:

 To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.

-Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <(https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):
+Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):

 .. code-block:: bash

@ -41,6 +43,8 @@ Note: To use *Distributed Training*\ , you will need to run one training script

 Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.

+.. _fine-tuning-bert-examples:
+
 Fine-tuning with BERT: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@ -333,10 +337,12 @@ LM Fine-tuning
 ~~~~~~~~~~~~~~

 The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
-You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and splitted into ~500k sentences with spaCy.
+You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
 Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :

-Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `\ ``README`` <./examples/lm_finetuning/README.md>`_ of the `\ ``examples/lm_finetuning/`` <./examples/lm_finetuning/>`_ folder.
+Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `README <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/README.md>`_ of the `examples/lm_finetuning/ <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/>`_ folder.
+
+.. _fine-tuning:

 OpenAI GPT, Transformer-XL and GPT-2: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -402,6 +408,8 @@ Unconditional generation:

 The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.

+.. _fine-tuning-BERT-large:
+
 Fine-tuning BERT-large on GPUs
 ------------------------------

@ -571,23 +579,4 @@ Here is an example on MNLI:
     global_step = 18408
     loss = 0.04755385363816904

-This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model
-
-BERTology
---------
-
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
-
-
-* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
-* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
-* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
-
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
-
-
-* accessing all the hidden-states of BERT/GPT/GPT-2,
-* accessing all the attention weights for each head of BERT/GPT/GPT-2,
-* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
-
-To help you understand and use these features, we have added a specific example script: `\ ``bertology.py`` <./examples/bertology.py>`_ while extract information and prune a model pre-trained on MRPC.
+This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -11,8 +11,7 @@ Pytorch-Transformers
    usage
    examples
    notebooks
-    tpu
-    cli
+    converting_tensorflow_models
    migration
    bertology
    torchscript
@ -44,7 +43,7 @@ This repository contains op-for-op PyTorch reimplementations, pre-trained models
 * `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`_\ , and
 * `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`_.

-These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <#examples>`_ section below.
+These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <./examples.html>`_ section.

 Here are some information on these models:

@ -54,12 +53,19 @@ This PyTorch implementation of BERT is provided with `Google's pre-trained model
 **OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
 This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.

-**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\ *, Zhilin Yang*\ , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\*, Zhilin Yang\* , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.

-**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\ *, Jeffrey Wu*\ , Rewon Child, David Luan, Dario Amodei\ ** and Ilya Sutskever**.
+**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\*, Jeffrey Wu\* , Rewon Child, David Luan, Dario Amodei\*\* and Ilya Sutskever\*\*.
 This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.

+**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
+This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`_. TODO Lysandre filled
+
+**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
+This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`_. TODO Lysandre filled
+
+
 Content
 -------

@ -68,111 +74,180 @@ Content

   * - Section
     - Description
-   * - `Installation <#installation>`_
+   * - `Installation <./installation.html>`_
     - How to install the package
-   * - `Overview <#overview>`_
-     - Overview of the package
-   * - `Usage <#usage>`_
+   * - `Philosphy <./philosophy.html>`_
+     - The philosophy behind this package
+   * - `Usage <./usage.html>`_
     - Quickstart examples
-   * - `Doc <#doc>`_
-     - Detailed documentation
-   * - `Examples <#examples>`_
+   * - `Examples <./examples.html>`_
     - Detailed examples on how to fine-tune Bert
-   * - `Notebooks <#notebooks>`_
+   * - `Notebooks <./notebooks.html>`_
     - Introduction on the provided Jupyter Notebooks
-   * - `TPU <#tpu>`_
+   * - `TPU <./tpu.html>`_
     - Notes on TPU support and pretraining scripts
-   * - `Command-line interface <#Command-line-interface>`_
+   * - `Command-line interface <./cli.html>`_
     - Convert a TensorFlow checkpoint in a PyTorch dump
+   * - `Migration <./migration.html>`_
+     - Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0)
+   * - `Bertology <./bertology.html>`_
+     - TODO Lysandre didn't know how to fill
+   * - `TorchScript <./torchscript.html>`_
+     - Convert a model to TorchScript for use in other programming languages
+
+.. list-table::
+   :header-rows: 1
+
+   * - Section
+     - Description
+   * - `Overview <./model_doc/overview.html>`_
+     - Overview of the package
+   * - `BERT <./model_doc/bert.html>`_
+     - BERT Models, Tokenizers and optimizers
+   * - `OpenAI GPT <./model_doc/gpt.html>`_
+     - GPT Models, Tokenizers and optimizers
+   * - `TransformerXL <./model_doc/transformerxl.html>`_
+     - TransformerXL Models, Tokenizers and optimizers
+   * - `OpenAI GPT2 <./model_doc/gpt2.html>`_
+     - GPT2 Models, Tokenizers and optimizers
+   * - `XLM <./model_doc/xlm.html>`_
+     - XLM Models, Tokenizers and optimizers
+   * - `XLNet <./model_doc/xlnet.html>`_
+     - XLNet Models, Tokenizers and optimizers
+
+TODO Lysandre filled: might need an introduction for both parts. Is it even necessary, since there is a summary? Up to you Thom.

 Overview
 --------

-This package comprises the following classes that can be imported in Python and are detailed in the `Doc <#doc>`_ section of this readme:
+This package comprises the following classes that can be imported in Python and are detailed in the `documentation <./model_doc/overview.html>`_ section of this package:


 *
-  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_ file):
+  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`_ file):


-  * `BertModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L639>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
-  * `BertForMaskedLM <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L793>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
-  * `BertForNextSentencePrediction <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L854>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
-  * `BertForPreTraining <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L722>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
-  * `BertForSequenceClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L916>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
-  * `BertForMultipleChoice <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L982>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
-  * `BertForTokenClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1051>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
-  * `BertForQuestionAnswering <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1124>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
+  * `BertModel <./model_doc/bert.html#pytorch_transformers.BertModel>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
+  * `BertForMaskedLM <./model_doc/bert.html#pytorch_transformers.BertForMaskedLM>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
+  * `BertForNextSentencePrediction <./model_doc/bert.html#pytorch_transformers.BertForNextSentencePrediction>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
+  * `BertForPreTraining <./model_doc/bert.html#pytorch_transformers.BertForPreTraining>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
+  * `BertForSequenceClassification <./model_doc/bert.html#pytorch_transformers.BertForSequenceClassification>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `BertForMultipleChoice <./model_doc/bert.html#pytorch_transformers.BertForMultipleChoice>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `BertForTokenClassification <./model_doc/bert.html#pytorch_transformers.BertForTokenClassification>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
+  * `BertForQuestionAnswering <./model_doc/bert.html#pytorch_transformers.BertForQuestionAnswering>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).

 *
-  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_ file):
+  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`_ file):


-  * `OpenAIGPTModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L536>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
-  * `OpenAIGPTLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L643>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `OpenAIGPTDoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L722>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `OpenAIGPTModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTModel>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
+  * `OpenAIGPTLMHeadModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTLMHeadModel>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `OpenAIGPTDoubleHeadsModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTDoubleHeadsModel>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),

 *
-  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ file):
+  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`_ file):


-  * `TransfoXLModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L983>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
-  * `TransfoXLLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L1260>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLModel>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLLMHeadModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLLMHeadModel>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),

 *
-  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py>`_ file):
+  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <./_modules/pytorch_transformers/modeling_gpt2.html>`_ file):


-  * `GPT2Model <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L479>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
-  * `GPT2LMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L559>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `GPT2DoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L624>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `GPT2Model <./model_doc/gpt2.html#pytorch_transformers.GPT2Model>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
+  * `GPT2LMHeadModel <./model_doc/gpt2.html#pytorch_transformers.GPT2LMHeadModel>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `GPT2DoubleHeadsModel <./model_doc/gpt2.html#pytorch_transformers.GPT2DoubleHeadsModel>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),

 *
-  Tokenizers for **BERT** (using word-piece) (in the `tokenization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py>`_ file):
+  Four **XLM** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`_ file):


+  * `XLMModel <./model_doc/xlm.html#pytorch_transformers.XLMModel>`_ - raw XLM Transformer model (\ **fully pre-trained**\ ),
+  * `XLMWithLMHeadModel <./model_doc/xlm.html#pytorch_transformers.XLMWithLMHeadModel>`_ - XLM Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLMForSequenceClassification <./model_doc/xlm.html#pytorch_transformers.XLMForSequenceClassification>`_ - XLM Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLMForQuestionAnswering <./model_doc/xlm.html#pytorch_transformers.XLMForQuestionAnswering>`_ - XLM Transformer with a token classification head on top (XLM Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+
+*
+  Four **XLNet** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`_ file):
+
+
+  * `XLNetModel <./model_doc/xlnet.html#pytorch_transformers.XLNetModel>`_ - raw XLNet Transformer model (\ **fully pre-trained**\ ),
+  * `XLNetLMHeadModel <./model_doc/xlnet.html#pytorch_transformers.XLNetLMHeadModel>`_ - XLNet Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLNetForSequenceClassification <./model_doc/xlnet.html#pytorch_transformers.XLNetForSequenceClassification>`_ - XLNet Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLNetForQuestionAnswering <./model_doc/xlnet.html#pytorch_transformers.XLNetForQuestionAnswering>`_ - XLNet Transformer with a token classification head on top (XLNet Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+
+
+TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers because I don't know the current philosophy behind them.
+
+*
+  Tokenizers for **BERT** (using word-piece) (in the `tokenization_bert.py <./_modules/pytorch_transformers/tokenization_bert.html>`_ file):
+
  * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.),
  * ``WordpieceTokenizer`` - WordPiece tokenization,
  * ``BertTokenizer`` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

-*
-  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_openai.py>`_ file):

+*
+  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <./_modules/pytorch_transformers/tokenization_openai.html>`_ file):

  * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.

-*
-  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ file):
-
-
-  * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.

 *
-  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_gpt2.py>`_ file):
-
+  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <./_modules/pytorch_transformers/tokenization_gpt2.html>`_ file):

  * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.

+
 *
-  Optimizer for **BERT** (in the `optimization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/pytorch_pretrained_bert/optimization.py>`_ file):
+  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <./_modules/pytorch_transformers/tokenization_transfo_xl.html>`_ file):
+
+  * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
+
+
+*
+  Tokenizer for **XLNet** (SentencePiece based tokenizer) (in the `tokenization_xlnet.py <./_modules/pytorch_transformers/tokenization_xlnet.html>`_ file):
+
+  * ``XLNetTokenizer`` - perform SentencePiece tokenization.
+
+
+*
+  Tokenizer for **XLM** (using Byte-Pair-Encoding) (in the `tokenization_xlm.py <./_modules/pytorch_transformers/tokenization_xlm.html>`_ file):
+
+  * ``GPT2Tokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
+
+
+*
+  Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`_ file):


  * ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.

+
 *
-  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/optimization_openai.py>`_ file):
+  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`_ file):


  * ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.

+
 *
-  Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_\ , `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_\ , `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ files):
+  Configuration classes for BERT, OpenAI GPT, Transformer-XL, XLM and XLNet (in the respective \
+  `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`_\ , \
+  `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`_\ , \
+  `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`_, \
+  `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`_, \
+  `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`_ \
+  files):


  * ``BertConfig`` - Configuration class to store the configuration of a ``BertModel`` with utilities to read and write from JSON configuration files.
  * ``OpenAIGPTConfig`` - Configuration class to store the configuration of a ``OpenAIGPTModel`` with utilities to read and write from JSON configuration files.
  * ``GPT2Config`` - Configuration class to store the configuration of a ``GPT2Model`` with utilities to read and write from JSON configuration files.
  * ``TransfoXLConfig`` - Configuration class to store the configuration of a ``TransfoXLModel`` with utilities to read and write from JSON configuration files.
+  * ``XLMConfig`` - Configuration class to store the configuration of a ``XLMModel`` with utilities to read and write from JSON configuration files.
+  * ``XLNetConfig`` - Configuration class to store the configuration of a ``XLNetModel`` with utilities to read and write from JSON configuration files.

 The repository further comprises:

--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@ -21,56 +21,56 @@ BERT
 .. autoclass:: pytorch_transformers.BertAdam
    :members:

-1. ``BertModel``
+``BertModel``
 ~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertModel
    :members:


-2. ``BertForPreTraining``
+``BertForPreTraining``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForPreTraining
    :members:


-3. ``BertForMaskedLM``
+``BertForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForMaskedLM
    :members:


-4. ``BertForNextSentencePrediction``
+``BertForNextSentencePrediction``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForNextSentencePrediction
    :members:


-5. ``BertForSequenceClassification``
+``BertForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForSequenceClassification
    :members:


-6. ``BertForMultipleChoice``
+``BertForMultipleChoice``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForMultipleChoice
    :members:


-7. ``BertForTokenClassification``
+``BertForTokenClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForTokenClassification
    :members:


-8. ``BertForQuestionAnswering``
+``BertForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.BertForQuestionAnswering
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@ -22,21 +22,21 @@ OpenAI GPT
    :members:


-9. ``OpenAIGPTModel``
+``OpenAIGPTModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.OpenAIGPTModel
    :members:


-10. ``OpenAIGPTLMHeadModel``
+``OpenAIGPTLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.OpenAIGPTLMHeadModel
    :members:


-11. ``OpenAIGPTDoubleHeadsModel``
+``OpenAIGPTDoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.OpenAIGPTDoubleHeadsModel
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@ -15,21 +15,21 @@ OpenAI GPT2
    :members:


-14. ``GPT2Model``
+``GPT2Model``
 ~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.GPT2Model
    :members:


-15. ``GPT2LMHeadModel``
+``GPT2LMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.GPT2LMHeadModel
    :members:


-16. ``GPT2DoubleHeadsModel``
+``GPT2DoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.GPT2DoubleHeadsModel
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
@ -14,13 +14,8 @@ Here is a detailed documentation of the classes in the package and how to use th
   * - `Serialization best-practices <#serialization-best-practices>`__
     - How to save and reload a fine-tuned model
   * - `Configurations <#configurations>`__
-     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Models <#models>`__
-     - API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Tokenizers <#tokenizers>`__
-     - API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Optimizers <#optimizers>`__
-     - API of the optimizers
+
+TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.


 Configurations
@ -245,7 +240,7 @@ An overview of the implemented schedules:


 * ``ConstantLR``\ : always returns learning rate 1.
-* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
    Keeps learning rate equal to 1. after warmup.

  .. image:: /imgs/warmup_constant_schedule.png
@ -253,7 +248,7 @@ An overview of the implemented schedules:
     :alt:


-* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
    Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.

  .. image:: /imgs/warmup_linear_schedule.png
@ -261,9 +256,9 @@ An overview of the implemented schedules:
     :alt:


-* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
-   Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve.
-   If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
+* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
+  Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve. \
+  If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.

  .. image:: /imgs/warmup_cosine_schedule.png
     :target: /imgs/warmup_cosine_schedule.png
@ -271,7 +266,7 @@ An overview of the implemented schedules:


 * ``WarmupCosineWithHardRestartsSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
-    If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).
+  If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).

  .. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
     :target: /imgs/warmup_cosine_hard_restarts_schedule.png
@ -279,9 +274,9 @@ An overview of the implemented schedules:


 * ``WarmupCosineWithWarmupRestartsSchedule`` : All training progress is divided in ``cycles`` (default=1.) parts of equal length.
-    Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
-    followed by a learning rate decreasing from 1. to 0. following a cosine curve.
-    Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``
+  Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
+  followed by a learning rate decreasing from 1. to 0. following a cosine curve.
+  Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``

  .. image:: /imgs/warmup_cosine_warm_restarts_schedule.png
     :target: /imgs/warmup_cosine_warm_restarts_schedule.png
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@ -16,14 +16,14 @@ Transformer XL
    :members:


-12. ``TransfoXLModel``
+``TransfoXLModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.TransfoXLModel
    :members:


-13. ``TransfoXLLMHeadModel``
+``TransfoXLLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.TransfoXLLMHeadModel
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@ -7,29 +7,34 @@ XLM
 .. autoclass:: pytorch_transformers.XLMConfig
    :members:

+``XLMTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-17. ``XLMModel``
+.. autoclass:: pytorch_transformers.XLMTokenizer
+    :members:
+
+``XLMModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMModel
    :members:


-18. ``XLMWithLMHeadModel``
+``XLMWithLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMWithLMHeadModel
    :members:


-19. ``XLMForSequenceClassification``
+``XLMForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMForSequenceClassification
    :members:


-20. ``XLMForQuestionAnswering``
+``XLMForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMForQuestionAnswering
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@ -8,28 +8,35 @@ XLNet
    :members:


-21. ``XLNetModel``
+``XLNetTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetTokenizer
+    :members:
+
+
+``XLNetModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLNetModel
    :members:


-22. ``XLNetLMHeadModel``
+``XLNetLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLNetLMHeadModel
    :members:


-23. ``XLNetForSequenceClassification``
+``XLNetForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLNetForSequenceClassification
    :members:


-24. ``XLNetForQuestionAnswering``
+``XLNetForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLNetForQuestionAnswering
--- a/docs/source/notebooks.rst
+++ b/docs/source/notebooks.rst
@ -5,12 +5,12 @@ We include `three Jupyter Notebooks <https://github.com/huggingface/pytorch-pret


 *
-  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <./notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
+  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.

 *
-  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
+  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.

 *
-  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
+  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/mnotebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.

 Please follow the instructions given in the notebooks to run and modify them.
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
@ -1,13 +0,0 @@
-TPU
-================================================
-
-TPU support and pretraining scripts
------------------------------------------------
-
-TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent `official announcement <https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud>`_\ ).
-
-We will add TPU support when this next release is published.
-
-The original TensorFlow code further comprises two scripts for pre-training BERT: `create_pretraining_data.py <https://github.com/google-research/bert/blob/master/create_pretraining_data.py>`_ and `run_pretraining.py <https://github.com/google-research/bert/blob/master/run_pretraining.py>`_.
-
-Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details `here <https://github.com/google-research/bert#pre-training-with-bert>`_\ ) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@ -4,7 +4,7 @@ Usage
 BERT
 ^^^^

-Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <./model_doc/overview.html>`_ below for all the details on these classes.

 First let's prepare a tokenized input with ``BertTokenizer``

@ -82,7 +82,7 @@ And how to use ``BertForMaskedLM``
 OpenAI GPT
 ^^^^^^^^^^

-Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.

 First let's prepare a tokenized input with ``OpenAIGPTTokenizer``

@ -170,7 +170,7 @@ And how to use ``OpenAIGPTDoubleHeadsModel``
 Transformer-XL
 ^^^^^^^^^^^^^^

-Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.

 First let's prepare a tokenized input with ``TransfoXLTokenizer``

@ -246,7 +246,7 @@ And how to use ``TransfoXLLMHeadModel``
 OpenAI GPT-2
 ^^^^^^^^^^^^

-Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.

 First let's prepare a tokenized input with ``GPT2Tokenizer``

--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@ -1274,20 +1274,20 @@ class BertForQuestionAnswering(BertPreTrainedModel):
        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**

        Parameters:
-            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
                with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            `token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
                a `sentence B` token (see BERT paper for more details).
-            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            `attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
                input sequence length in the current batch. It's the mask that we typically use for attention when
                a batch has varying length sentences.
-            `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
+            `start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
                into account for computing the loss.
-            `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
+            `end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
                into account for computing the loss.
            `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@ -77,10 +77,15 @@ def text_standardize(text):
 class XLMTokenizer(PreTrainedTokenizer):
    """
    BPE tokenizer for XLM, adapted from OpenAI BPE tokenizer. Peculiarities:
+
        - lower case all inputs
-        - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
-        - argument special_tokens and function set_special_tokens:
-            can be used to add additional symbols (ex: "__classify__") to a vocabulary.
+
+        - uses `SpaCy tokenizer <https://spacy.io/api/tokenizer/>`_ and \
+        `ftfy <https://ftfy.readthedocs.io/en/latest/>`_ for pre-BPE tokenization if they are installed, \
+        fallback to BERT's BasicTokenizer if not.
+
+        - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \
+        (ex: "__classify__") to a vocabulary.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@ -52,7 +52,8 @@ SEG_ID_PAD = 4
 class XLNetTokenizer(PreTrainedTokenizer):
    """
        SentencePiece based tokenizer. Peculiarities:
-            - requires SentencePiece: https://github.com/google/sentencepiece
+
+            - requires `SentencePiece <https://github.com/google/sentencepiece>`_
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP