
* aligning for vllm * using input shape rather than attn outputs * remove demo * revert Conv1D * style * style * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix copies * Apply suggestions from code review Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * adding docs about vllm * chore: style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
6.5 KiB
GPT-2
GPT-2 is a scaled up version of GPT, a causal transformer language model, with 10x more parameters and training data. The model was pretrained on a 40GB dataset to predict the next word in a sequence based on all the previous words. This approach enabled the model to perform many downstream tasks in a zero-shot setting.
The model architecture uses a unidirectional (causal) attention mechanism where each token can only attend to previous tokens, making it particularly effective for text generation tasks.
You can find all the original GPT-2 checkpoints under the OpenAI community organization.
Tip
Click on the GPT-2 models in the right sidebar for more examples of how to apply GPT-2 to different language tasks.
The example below demonstrates how to generate text with [Pipeline
] or the [AutoModel
], and from the command line.
import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="openai-community/gpt2", torch_dtype=torch.float16, device=0)
pipeline("Hello, I'm a language model")
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
input_ids = tokenzier("Hello, I'm a language model". return_tensors="pt").to("cuda")
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
echo -e "Hello, I'm a language model" | transformers run --task text-generation --model openai-community/gpt2 --device 0
One can also serve the model using vLLM with the transformers backend
.
vllm serve openai-community/gpt2 --model-imp transformers
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to only quantize the weights to 4-bits.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"openai-community/gpt2-xl",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Notes
- Pad inputs on the right because GPT-2 uses absolute position embeddings.
- GPT-2 can reuse previously computed key-value attention pairs. Access this feature with the past_key_values parameter in [
GPT2Model.forward
]. - Enable the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn parameters to apply the training stability improvements from Mistral.
GPT2Config
autodoc GPT2Config
GPT2Tokenizer
autodoc GPT2Tokenizer - save_vocabulary
GPT2TokenizerFast
autodoc GPT2TokenizerFast
GPT2 specific outputs
autodoc models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
autodoc models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
GPT2Model
autodoc GPT2Model - forward
GPT2LMHeadModel
autodoc GPT2LMHeadModel - forward
GPT2DoubleHeadsModel
autodoc GPT2DoubleHeadsModel - forward
GPT2ForQuestionAnswering
autodoc GPT2ForQuestionAnswering - forward
GPT2ForSequenceClassification
autodoc GPT2ForSequenceClassification - forward
GPT2ForTokenClassification
autodoc GPT2ForTokenClassification - forward
TFGPT2Model
autodoc TFGPT2Model - call
TFGPT2LMHeadModel
autodoc TFGPT2LMHeadModel - call
TFGPT2DoubleHeadsModel
autodoc TFGPT2DoubleHeadsModel - call
TFGPT2ForSequenceClassification
autodoc TFGPT2ForSequenceClassification - call
TFSequenceClassifierOutputWithPast
autodoc modeling_tf_outputs.TFSequenceClassifierOutputWithPast
TFGPT2Tokenizer
autodoc TFGPT2Tokenizer
FlaxGPT2Model
autodoc FlaxGPT2Model - call
FlaxGPT2LMHeadModel
autodoc FlaxGPT2LMHeadModel - call