mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-05 22:00:09 +06:00

* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
132 lines
4.4 KiB
Plaintext
132 lines
4.4 KiB
Plaintext
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# OpenAI GPT2
|
|
|
|
## Overview
|
|
|
|
OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec
|
|
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
|
|
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
|
|
web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
|
|
text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
|
|
across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
|
|
10X the amount of data.*
|
|
|
|
Tips:
|
|
|
|
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
|
the left.
|
|
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
|
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
|
|
observed in the *run_generation.py* example script.
|
|
- The model can take the *past_key_values* (for PyTorch) or *past* (for TF) as input, which is the previously computed
|
|
key/value attention pairs. Using this (*past_key_values* or *past*) value prevents the model from re-computing
|
|
pre-computed values in the context of text generation. For PyTorch, see *past_key_values* argument of the
|
|
[`GPT2Model.forward`] method, or for TF the *past* argument of the
|
|
[`TFGPT2Model.call`] method for more information on its usage.
|
|
- Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability
|
|
improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only).
|
|
|
|
[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by
|
|
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
|
|
different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*.
|
|
|
|
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/).
|
|
|
|
|
|
## GPT2Config
|
|
|
|
[[autodoc]] GPT2Config
|
|
|
|
## GPT2Tokenizer
|
|
|
|
[[autodoc]] GPT2Tokenizer
|
|
- save_vocabulary
|
|
|
|
## GPT2TokenizerFast
|
|
|
|
[[autodoc]] GPT2TokenizerFast
|
|
|
|
## GPT2 specific outputs
|
|
|
|
[[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
|
|
|
|
[[autodoc]] models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
|
|
|
|
## GPT2Model
|
|
|
|
[[autodoc]] GPT2Model
|
|
- forward
|
|
- parallelize
|
|
- deparallelize
|
|
|
|
## GPT2LMHeadModel
|
|
|
|
[[autodoc]] GPT2LMHeadModel
|
|
- forward
|
|
- parallelize
|
|
- deparallelize
|
|
|
|
## GPT2DoubleHeadsModel
|
|
|
|
[[autodoc]] GPT2DoubleHeadsModel
|
|
- forward
|
|
|
|
## GPT2ForSequenceClassification
|
|
|
|
[[autodoc]] GPT2ForSequenceClassification
|
|
- forward
|
|
|
|
## GPT2ForTokenClassification
|
|
|
|
[[autodoc]] GPT2ForTokenClassification
|
|
- forward
|
|
|
|
## TFGPT2Model
|
|
|
|
[[autodoc]] TFGPT2Model
|
|
- call
|
|
|
|
## TFGPT2LMHeadModel
|
|
|
|
[[autodoc]] TFGPT2LMHeadModel
|
|
- call
|
|
|
|
## TFGPT2DoubleHeadsModel
|
|
|
|
[[autodoc]] TFGPT2DoubleHeadsModel
|
|
- call
|
|
|
|
## TFGPT2ForSequenceClassification
|
|
|
|
[[autodoc]] TFGPT2ForSequenceClassification
|
|
- call
|
|
|
|
## TFSequenceClassifierOutputWithPast
|
|
|
|
[[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutputWithPast
|
|
|
|
## FlaxGPT2Model
|
|
|
|
[[autodoc]] FlaxGPT2Model
|
|
- __call__
|
|
|
|
## FlaxGPT2LMHeadModel
|
|
|
|
[[autodoc]] FlaxGPT2LMHeadModel
|
|
- __call__
|