Patched warnings + Refactored XLNet's Docstrings

2025-07-31 02:02:21 +06:00 · 2019-07-09 16:38:30 -04:00 · 2019-07-09 16:38:30 -04:00 · 83fb311ef7
commit 83fb311ef7
parent 8fe2c9d98e
7 changed files with 268 additions and 229 deletions
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@ -29,9 +29,9 @@ Here is how to use these techniques in our scripts:
 * **Gradient Accumulation**\ : Gradient accumulation can be used by supplying a integer greater than 1 to the ``--gradient_accumulation_steps`` argument. The batch at each step will be divided by this integer and gradient will be accumulated over ``gradient_accumulation_steps`` steps.
 * **Multi-GPU**\ : Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
 * **Distributed training**\ : Distributed training can be activated by supplying an integer greater or equal to 0 to the ``--local_rank`` argument (see below).
-* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`_ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`_. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
+* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`__ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.

-To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`_. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.
+To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.

 Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <(https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):

@ -153,10 +153,10 @@ and unpack it to some directory ``$GLUE_DIR``.
     --num_train_epochs 3.0 \
     --output_dir /tmp/mrpc_output/

-Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`_ gave evaluation results between 84% and 88%.
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`__ gave evaluation results between 84% and 88%.

 **Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
-First install apex as indicated `here <https://github.com/NVIDIA/apex>`_.
+First install apex as indicated `here <https://github.com/NVIDIA/apex>`__.
 Then run

 .. code-block:: shell
@ -520,7 +520,7 @@ and unpack it to some directory ``$GLUE_DIR``.
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/

-Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`_ gave evaluation results between 84% and 88%.
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`__ gave evaluation results between 84% and 88%.

 **Distributed training**
 Here is an example using distributed training on 8 V100 GPUs to reach XXXX:
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -52,13 +52,13 @@ Here are some information on these models:
 This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`_\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.

 **OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`_ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
+This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.

 **Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\ *, Zhilin Yang*\ , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.

 **OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\ *, Jeffrey Wu*\ , Rewon Child, David Luan, Dario Amodei\ ** and Ilya Sutskever**.
-This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`_ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
+This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.

 Content
 -------
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
@ -9,17 +9,17 @@ Here is a detailed documentation of the classes in the package and how to use th

   * - Sub-section
     - Description
-   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`_
+   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`__
     - How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
-   * - `Serialization best-practices <#serialization-best-practices>`_
+   * - `Serialization best-practices <#serialization-best-practices>`__
     - How to save and reload a fine-tuned model
-   * - `Configurations <#configurations>`_
+   * - `Configurations <#configurations>`__
     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Models <#models>`_
+   * - `Models <#models>`__
     - API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Tokenizers <#tokenizers>`_
+   * - `Tokenizers <#tokenizers>`__
     - API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Optimizers <#optimizers>`_
+   * - `Optimizers <#optimizers>`__
     - API of the optimizers


@ -77,7 +77,7 @@ where
    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`_
+    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
@ -93,7 +93,7 @@ where
    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )

-  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/modeling.py>`_\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
+  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/modeling.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).

 *
  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
@ -102,7 +102,7 @@ where
 * ``state_dict``\ : an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
 * ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)

-``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`_ or the original TensorFlow repository.
+``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.

 When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).

@ -152,7 +152,7 @@ This section explain how you can save and re-load a fine-tuned model (BERT, GPT,
 There are three types of files you need to save to be able to reload a fine-tuned model:


-* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`_\ ,
+* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
 * the configuration file of the model which is saved as a JSON file, and
 * the vocabulary (and the merges for the BPE-based models GPT and GPT-2).

--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@ -2,35 +2,35 @@ XLM
 ----------------------------------------------------

 ``XLMConfig``
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TransfoXLConfig
+.. autoclass:: pytorch_transformers.XLMConfig
    :members:


 17. ``XLMModel``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMModel
    :members:


 18. ``XLMWithLMHeadModel``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMWithLMHeadModel
    :members:


 19. ``XLMForSequenceClassification``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMForSequenceClassification
    :members:


 20. ``XLMForQuestionAnswering``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: pytorch_transformers.XLMForQuestionAnswering
    :members:
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@ -1,4 +1,36 @@
 XLNet
 ----------------------------------------------------

-I don't really know what to put here, I'll leave it up to you to decide @Thom
+``XLNetConfig``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetConfig
+    :members:
+
+
+21. ``XLNetModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetModel
+    :members:
+
+
+22. ``XLNetLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetLMHeadModel
+    :members:
+
+
+23. ``XLNetForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetForSequenceClassification
+    :members:
+
+
+24. ``XLNetForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.XLNetForQuestionAnswering
+    :members:
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@ -729,7 +729,9 @@ class BertModel(BertPreTrainedModel):
 class BertForPreTraining(BertPreTrainedModel):
    """BERT model with pre-training heads.
    This module comprises the BERT model followed by the two pre-training heads:
+
        - the masked language modeling head, and
+
        - the next sentence classification head.

    Args:
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@ -192,7 +192,48 @@ ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}


 class XLNetConfig(PretrainedConfig):
-    """Configuration class to store the configuration of a `XLNetModel`.
+    """Configuration class to store the configuration of a ``XLNetModel``.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of ``inputs_ids`` in ``XLNetModel``.
+        d_model: Size of the encoder layers and the pooler layer.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        d_inner: The size of the "intermediate" (i.e., feed-forward)
+            layer in the Transformer encoder.
+        ff_activation: The non-linear activation function (function or string) in the
+            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+        untie_r: untie relative position biases
+        attn_type: 'bi' for XLNet, 'uni' for Transformer-XL
+
+        dropout: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        dropatt: The dropout ratio for the attention
+            probabilities.
+        max_position_embeddings: The maximum sequence length that this model might
+            ever be used with. Typically set this to something large just in case
+            (e.g., 512 or 1024 or 2048).
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+        layer_norm_eps: The epsilon used by LayerNorm.
+
+        dropout: float, dropout rate.
+        dropatt: float, dropout rate on attention probabilities.
+        init: str, the initialization scheme, either "normal" or "uniform".
+        init_range: float, initialize the parameters with a uniform distribution
+            in [-init_range, init_range]. Only effective when init="uniform".
+        init_std: float, initialize the parameters with a normal distribution
+            with mean 0 and stddev init_std. Only effective when init="normal".
+        mem_len: int, the number of tokens to cache.
+        reuse_len: int, the number of tokens in the currect batch to be cached
+            and reused in the future.
+        bi_data: bool, whether to use bidirectional input pipeline.
+            Usually set to True during pretraining and False during finetuning.
+        clamp_len: int, clamp all relative distances larger than clamp_len.
+            -1 means no clamping.
+        same_length: bool, whether to use the same attention length for each token.
+        finetuning_task: name of the glue task on which the model was fine-tuned if any
    """
    pretrained_config_archive_map = XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP

@ -231,47 +272,6 @@ class XLNetConfig(PretrainedConfig):
                 end_n_top=5,
                 **kwargs):
        """Constructs XLNetConfig.
-
-        Args:
-            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `XLNetModel`.
-            d_model: Size of the encoder layers and the pooler layer.
-            n_layer: Number of hidden layers in the Transformer encoder.
-            n_head: Number of attention heads for each attention layer in
-                the Transformer encoder.
-            d_inner: The size of the "intermediate" (i.e., feed-forward)
-                layer in the Transformer encoder.
-            ff_activation: The non-linear activation function (function or string) in the
-                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
-            untie_r: untie relative position biases
-            attn_type: 'bi' for XLNet, 'uni' for Transformer-XL
-
-            dropout: The dropout probabilitiy for all fully connected
-                layers in the embeddings, encoder, and pooler.
-            dropatt: The dropout ratio for the attention
-                probabilities.
-            max_position_embeddings: The maximum sequence length that this model might
-                ever be used with. Typically set this to something large just in case
-                (e.g., 512 or 1024 or 2048).
-            initializer_range: The sttdev of the truncated_normal_initializer for
-                initializing all weight matrices.
-            layer_norm_eps: The epsilon used by LayerNorm.
-
-            dropout: float, dropout rate.
-            dropatt: float, dropout rate on attention probabilities.
-            init: str, the initialization scheme, either "normal" or "uniform".
-            init_range: float, initialize the parameters with a uniform distribution
-                in [-init_range, init_range]. Only effective when init="uniform".
-            init_std: float, initialize the parameters with a normal distribution
-                with mean 0 and stddev init_std. Only effective when init="normal".
-            mem_len: int, the number of tokens to cache.
-            reuse_len: int, the number of tokens in the currect batch to be cached
-                and reused in the future.
-            bi_data: bool, whether to use bidirectional input pipeline.
-                Usually set to True during pretraining and False during finetuning.
-            clamp_len: int, clamp all relative distances larger than clamp_len.
-                -1 means no clamping.
-            same_length: bool, whether to use the same attention length for each token.
-            finetuning_task: name of the glue task on which the model was fine-tuned if any
        """
        super(XLNetConfig, self).__init__(**kwargs)

@ -621,6 +621,18 @@ class XLNetPreTrainedModel(PreTrainedModel):


 class XLNetModel(XLNetPreTrainedModel):
+    """XLNet model ("XLNet: Generalized Autoregressive Pretraining for Language Understanding").
+
+    TODO: this was copied from the XLNetLMHeadModel, check that it's ok.
+
+    Args:
+        `config`: a XLNetConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    TODO: Add usage
+    """
    def __init__(self, config):
        super(XLNetModel, self).__init__(config)
        self.output_attentions = config.output_attentions
@ -647,15 +659,23 @@ class XLNetModel(XLNetPreTrainedModel):
        pass

    def create_mask(self, qlen, mlen):
-        """ create causal attention mask.
-            float mask where 1.0 indicate masked, 0.0 indicated not-masked.
-             same_length=False:      same_length=True:
-             <mlen > <  qlen >       <mlen > <  qlen >
-          ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]
-            [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]
-       qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]
-            [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]
-          v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]
+        """
+        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.
+
+        Args:
+            qlen: TODO
+            mlen: TODO
+
+        ::
+
+                  same_length=False:      same_length=True:
+                  <mlen > <  qlen >       <mlen > <  qlen >
+               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]
+                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]
+            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]
+                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]
+               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]
+
        """
        attn_mask = torch.ones([qlen, qlen])
        mask_up = torch.triu(attn_mask, diagonal=1)
@ -736,6 +756,8 @@ class XLNetModel(XLNetPreTrainedModel):
    def forward(self, input_ids, token_type_ids=None, input_mask=None, attention_mask=None,
                mems=None, perm_mask=None, target_mapping=None, inp_q=None, head_mask=None):
        """
+        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
+
        Args:
            input_ids: int32 Tensor in shape [bsz, len], the input token IDs.
            token_type_ids: int32 Tensor in shape [bsz, len], the input segment IDs.
@ -772,6 +794,8 @@ class XLNetModel(XLNetPreTrainedModel):
            same_length: bool, whether to use the same attention length for each token.
            summary_type: str, "last", "first", "mean", or "attn". The method
                to pool the input to get a vector representation.
+
+        TODO: Add usage
        """
        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
        # but we want a unified interface in the library with the batch size on the first dimension
@ -921,63 +945,20 @@ class XLNetModel(XLNetPreTrainedModel):
 class XLNetLMHeadModel(XLNetPreTrainedModel):
    """XLNet model ("XLNet: Generalized Autoregressive Pretraining for Language Understanding").

-    Params:
+    Args:
        `config`: a XLNetConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
            This can be used to compute head importance metrics. Default: False

-    Inputs:
-        input_ids: int32 Tensor in shape [bsz, len], the input token IDs.
-        token_type_ids: int32 Tensor in shape [bsz, len], the input segment IDs.
-        input_mask: [optional] float32 Tensor in shape [bsz, len], the input mask.
-            0 for real tokens and 1 for padding.
-        attention_mask: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
-            but with 1 for real tokens and 0 for padding.
-            Added for easy compatibility with the BERT model (which uses this negative masking).
-            You can only uses one among `input_mask` and `attention_mask`
-        mems: [optional] a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
-            from previous batches. The length of the list equals n_layer.
-            If None, no memory is used.
-        perm_mask: [optional] float32 Tensor in shape [bsz, len, len].
-            If perm_mask[k, i, j] = 0, i attend to j in batch k;
-            if perm_mask[k, i, j] = 1, i does not attend to j in batch k.
-            If None, each position attends to all the others.
-        target_mapping: [optional] float32 Tensor in shape [bsz, num_predict, len].
-            If target_mapping[k, i, j] = 1, the i-th predict in batch k is
-            on the j-th token.
-            Only used during pretraining for partial prediction.
-            Set to None during finetuning.
-        inp_q: [optional] float32 Tensor in shape [bsz, len].
-            1 for tokens with losses and 0 for tokens without losses.
-            Only used during pretraining for two-stream attention.
-            Set to None during finetuning.


-    Outputs: Tuple of (encoded_layers, pooled_output)
-        `encoded_layers`: controled by `output_all_encoded_layers` argument:
-            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
-                of each attention block (i.e. 12 full sequences for XLNet-base, 24 for XLNet-large), each
-                encoded-hidden-state is a ``torch.FloatTensor`` of size [batch_size, sequence_length, d_model],
-            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
-                to the last attention block of shape [batch_size, sequence_length, d_model],
-        `pooled_output`: a ``torch.FloatTensor`` of size [batch_size, d_model] which is the output of a
-            classifier pretrained on top of the hidden state associated to the first character of the
-            input (`CLS`) to train on the Next-Sentence task (see XLNet's paper).
+    Example::

-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+        config = modeling.XLNetConfig(vocab_size_or_config_json_file=32000, d_model=768,
+            n_layer=12, num_attention_heads=12, intermediate_size=3072)

-    config = modeling.XLNetConfig(vocab_size_or_config_json_file=32000, d_model=768,
-        n_layer=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = modeling.XLNetModel(config=config)
-    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-    ```
+        model = modeling.XLNetModel(config=config)
    """
    def __init__(self, config):
        super(XLNetLMHeadModel, self).__init__(config)
@ -1005,34 +986,61 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
                mems=None, perm_mask=None, target_mapping=None, inp_q=None,
                labels=None, head_mask=None):
        """
+         all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+
        Args:
            input_ids: int32 Tensor in shape [bsz, len], the input token IDs.
            token_type_ids: int32 Tensor in shape [bsz, len], the input segment IDs.
-            input_mask: float32 Tensor in shape [bsz, len], the input mask.
+            input_mask: [optional] float32 Tensor in shape [bsz, len], the input mask.
                0 for real tokens and 1 for padding.
            attention_mask: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
                but with 1 for real tokens and 0 for padding.
                Added for easy compatibility with the BERT model (which uses this negative masking).
                You can only uses one among `input_mask` and `attention_mask`
-            mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
+            mems: [optional] a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
                from previous batches. The length of the list equals n_layer.
                If None, no memory is used.
-            perm_mask: float32 Tensor in shape [bsz, len, len].
+            perm_mask: [optional] float32 Tensor in shape [bsz, len, len].
                If perm_mask[k, i, j] = 0, i attend to j in batch k;
                if perm_mask[k, i, j] = 1, i does not attend to j in batch k.
                If None, each position attends to all the others.
-            target_mapping: float32 Tensor in shape [bsz, num_predict, len].
+            target_mapping: [optional] float32 Tensor in shape [bsz, num_predict, len].
                If target_mapping[k, i, j] = 1, the i-th predict in batch k is
                on the j-th token.
                Only used during pretraining for partial prediction.
                Set to None during finetuning.
-            inp_q: float32 Tensor in shape [bsz, len].
+            inp_q: [optional] float32 Tensor in shape [bsz, len].
                1 for tokens with losses and 0 for tokens without losses.
                Only used during pretraining for two-stream attention.
                Set to None during finetuning.

-            summary_type: str, "last", "first", "mean", or "attn". The method
-                to pool the input to get a vector representation.
+
+        Returns:
+            A ``tuple(encoded_layers, pooled_output)``, with
+
+                ``encoded_layers``: controlled by ``output_all_encoded_layers`` argument:
+
+                    - ``output_all_encoded_layers=True``: outputs a list of the full sequences of encoded-hidden-states \
+                    at the end of each attention block (i.e. 12 full sequences for XLNet-base, 24 for XLNet-large), \
+                    each encoded-hidden-state is a ``torch.FloatTensor`` of size [batch_size, sequence_length, d_model],
+
+                    - ``output_all_encoded_layers=False``: outputs only the full sequence of hidden-states corresponding \
+                    to the last attention block of shape [batch_size, sequence_length, d_model],
+
+                ``pooled_output``: a ``torch.FloatTensor`` of size [batch_size, d_model] which is the output of a \
+                classifier pretrained on top of the hidden state associated to the first character of the \
+                input (`CLS`) to train on the Next-Sentence task (see XLNet's paper).
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+            # or
+            all_encoder_layers, pooled_output = model.forward(input_ids, token_type_ids, input_mask)
        """
        transformer_outputs = self.transformer(input_ids, token_type_ids, input_mask, attention_mask,
                                               mems, perm_mask, target_mapping, inp_q, head_mask)
@ -1054,7 +1062,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
 class XLNetForSequenceClassification(XLNetPreTrainedModel):
    """XLNet model ("XLNet: Generalized Autoregressive Pretraining for Language Understanding").

-    Params:
+    Args:
        `config`: a XLNetConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
@ -1062,58 +1070,16 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
        `summary_type`: str, "last", "first", "mean", or "attn". The method
            to pool the input to get a vector representation. Default: last

-    Inputs:
-        input_ids: int32 Tensor in shape [bsz, len], the input token IDs.
-        token_type_ids: int32 Tensor in shape [bsz, len], the input segment IDs.
-        input_mask: float32 Tensor in shape [bsz, len], the input mask.
-            0 for real tokens and 1 for padding.
-        attention_mask: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
-            but with 1 for real tokens and 0 for padding.
-            Added for easy compatibility with the BERT model (which uses this negative masking).
-            You can only uses one among `input_mask` and `attention_mask`
-        mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
-            from previous batches. The length of the list equals n_layer.
-            If None, no memory is used.
-        perm_mask: float32 Tensor in shape [bsz, len, len].
-            If perm_mask[k, i, j] = 0, i attend to j in batch k;
-            if perm_mask[k, i, j] = 1, i does not attend to j in batch k.
-            If None, each position attends to all the others.
-        target_mapping: float32 Tensor in shape [bsz, num_predict, len].
-            If target_mapping[k, i, j] = 1, the i-th predict in batch k is
-            on the j-th token.
-            Only used during pretraining for partial prediction.
-            Set to None during finetuning.
-        inp_q: float32 Tensor in shape [bsz, len].
-            1 for tokens with losses and 0 for tokens without losses.
-            Only used during pretraining for two-stream attention.
-            Set to None during finetuning.
-        `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.


-    Outputs: Tuple of (logits or loss, mems)
-        `logits or loss`:
-            if labels is None:
-                Token logits with shape [batch_size, sequence_length] 
-            else:
-                CrossEntropy loss with the targets
-        `new_mems`: list (num layers) of updated mem states at the entry of each layer
-            each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
-            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
+    Example::

-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+        # Already been converted into WordPiece token ids
+        input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+        input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+        token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = modeling.XLNetConfig(vocab_size_or_config_json_file=32000, d_model=768,
-        n_layer=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = modeling.XLNetModel(config=config)
-    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-    ```
+        all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
    """
    def __init__(self, config):
        super(XLNetForSequenceClassification, self).__init__(config)
@ -1129,6 +1095,8 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
                mems=None, perm_mask=None, target_mapping=None, inp_q=None,
                labels=None, head_mask=None):
        """
+        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
+
        Args:
            input_ids: int32 Tensor in shape [bsz, len], the input token IDs.
            token_type_ids: int32 Tensor in shape [bsz, len], the input segment IDs.
@ -1148,12 +1116,38 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
            target_mapping: float32 Tensor in shape [bsz, num_predict, len].
                If target_mapping[k, i, j] = 1, the i-th predict in batch k is
                on the j-th token.
-                Only used during pretraining for partial prediction.
-                Set to None during finetuning.
+                Only used during pre-training for partial prediction.
+                Set to None during fine-tuning.
            inp_q: float32 Tensor in shape [bsz, len].
                1 for tokens with losses and 0 for tokens without losses.
-                Only used during pretraining for two-stream attention.
-                Set to None during finetuning.
+                Only used during pre-training for two-stream attention.
+                Set to None during fine-tuning.
+            labels: TODO
+            head_mask: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+
+        Returns:
+            A ``tuple(logits_or_loss, mems)``
+
+                ``logits_or_loss``: if ``labels`` is ``None``, ``logits_or_loss`` corresponds to token logits with shape \
+                [batch_size, sequence_length]. If it is not ``None``, it corresponds to the ``CrossEntropy`` loss \
+                with the targets.
+
+                ``new_mems``: list (num layers) of updated mem states at the entry of each layer \
+                each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model] \
+                Note that the first two dimensions are transposed in ``mems`` with regards to ``input_ids`` and ``labels``
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+            # or
+            all_encoder_layers, pooled_output = model.forward(input_ids, token_type_ids, input_mask)
        """
        transformer_outputs = self.transformer(input_ids, token_type_ids, input_mask, attention_mask,
                                               mems, perm_mask, target_mapping, inp_q, head_mask)
@ -1178,60 +1172,24 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):


 class XLNetForQuestionAnswering(XLNetPreTrainedModel):
-    """ XLNet model for Question Answering (span extraction).
-    This module is composed of the XLNet model with a linear layer on top of
-    the sequence output that computes start_logits and end_logits
+    """
+    XLNet model for Question Answering (span extraction).

-    Params:
+    This module is composed of the XLNet model with a linear layer on top of
+    the sequence output that computes ``start_logits`` and ``end_logits``
+
+    Args:
        `config`: a XLNetConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
            This can be used to compute head importance metrics. Default: False

-    Inputs:
-        `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see XLNet paper for more details).
-        `attention_mask`: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
-            but with 1 for real tokens and 0 for padding.
-            Added for easy compatibility with the BERT model (which uses this negative masking).
-            You can only uses one among `input_mask` and `attention_mask`
-        `input_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-        `end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-        `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+    Example::

-    Outputs:
-        if `start_positions` and `end_positions` are not `None`:
-            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
-        if `start_positions` or `end_positions` is `None`:
-            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
-            position tokens of shape [batch_size, sequence_length].
+        config = XLNetConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = XLNetConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = XLNetForQuestionAnswering(config)
-    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
-    ```
+        model = XLNetForQuestionAnswering(config)
    """
    def __init__(self, config):
        super(XLNetForQuestionAnswering, self).__init__(config)
@ -1249,6 +1207,53 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
                mems=None, perm_mask=None, target_mapping=None, inp_q=None,
                start_positions=None, end_positions=None, cls_index=None, is_impossible=None, p_mask=None,
                head_mask=None):
+
+        """
+        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
+
+        Args:
+            `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens pre-processing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see XLNet paper for more details).
+            `attention_mask`: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
+                but with 1 for real tokens and 0 for padding.
+                Added for easy compatibility with the BERT model (which uses this negative masking).
+                You can only uses one among ``input_mask`` and ``attention_mask``
+            `input_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
+                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+                into account for computing the loss.
+            `end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
+                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+                into account for computing the loss.
+            `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            if ``start_positions`` and ``end_positions`` are not ``None``, outputs the total_loss which is the sum of the \
+            ``CrossEntropy`` loss for the start and end token positions.
+
+            if ``start_positions`` or ``end_positions`` is ``None``, outputs a tuple of ``start_logits``, ``end_logits``
+            which are the logits respectively for the start and end position tokens of shape \
+            [batch_size, sequence_length].
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
+            # or
+            start_logits, end_logits = model.forward(input_ids, token_type_ids, input_mask)
+        """
        transformer_outputs = self.transformer(input_ids, token_type_ids, input_mask, attention_mask,
                                               mems, perm_mask, target_mapping, inp_q, head_mask)
        hidden_states = transformer_outputs[0]