PyTorch Flax FlashAttention
## GPT-Neo [GPT-Neo](https://zenodo.org/records/5297715) is an open-source alternative to GPT-2 and GPT-3 models, built with Mesh TensorFlow for TPUs. GPT-Neo uses local attention in every other layer for more efficiency. It is trained on the [Pile](https://huggingface.co/datasets/EleutherAI/pile), a diverse dataset consisting of 22 smaller high-quality datasets. You can find all the original GPT-Neo checkpoints under the [EleutherAI](https://huggingface.co/EleutherAI?search_models=gpt-neo) organization. > [!TIP] > Click on the GPT-Neo models in the right sidebar for more examples of how to apply GPT Neo to different language tasks. The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line. ```py import torch from transformers import pipeline pipeline = pipeline(task="text-generation", model="EleutherAI/gpt-neo-1.3B", torch_dtype=torch.float16, device=0) pipeline("Hello, I'm a language model") ``` ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B", torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2") tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B") input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to("cuda") output = model.generate(**input_ids) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ```bash echo -e "Hello, I'm a language model" | transformers-cli run --task text-generation --model EleutherAI/gpt-neo-1.3B --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits. ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "EleutherAI/gpt-neo-2.7B", quantization_config=quantization_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B") inputs = tokenizer("Hello, I'm a language model", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Notes - Pad inputs on the right because GPT-Neo uses absolute position embeddings. ## GPTNeoConfig [[autodoc]] GPTNeoConfig ## GPTNeoModel [[autodoc]] GPTNeoModel - forward ## GPTNeoForCausalLM [[autodoc]] GPTNeoForCausalLM - forward ## GPTNeoForQuestionAnswering [[autodoc]] GPTNeoForQuestionAnswering - forward ## GPTNeoForSequenceClassification [[autodoc]] GPTNeoForSequenceClassification - forward ## GPTNeoForTokenClassification [[autodoc]] GPTNeoForTokenClassification - forward ## FlaxGPTNeoModel [[autodoc]] FlaxGPTNeoModel - __call__ ## FlaxGPTNeoForCausalLM [[autodoc]] FlaxGPTNeoForCausalLM - __call__