PyTorch Flax
# Llama 2 [Llama 2](https://huggingface.co/papers/2307.09288) is a family of large language models, Llama 2 and Llama 2-Chat, available in 7B, 13B, and 70B parameters. The Llama 2 model mostly keeps the same architecture as [Llama](./llama), but it is pretrained on more tokens, doubles the context length, and uses grouped-query attention (GQA) in the 70B model to improve inference. Llama 2-Chat is trained with supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF) - rejection sampling and proximal policy optimization (PPO) - is applied to the fine-tuned model to align the chat model with human preferences. You can find all the original Llama 2 checkpoints under the [Llama 2 Family](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) collection. > [!TIP] > Click on the Llama 2 models in the right sidebar for more examples of how to apply Llama to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and how to chat with Llama 2-Chat from the command line. ```py import torch from transformers import pipeline pipeline = pipeline( task="text-generation", model="meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device=0 ) pipeline("Plants create energy through a process known as") ``` ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Llama-2-7b-hf", ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa" ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ```bash transformers chat meta-llama/Llama-2-7b-chat-hf --torch_dtype auto --attn_implementation flash_attention_2 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4. ```py # pip install torchao import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer quantization_config = TorchAoConfig("int4_weight_only", group_size=128) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-13b-hf", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf") input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. ```py from transformers.utils.attention_visualizer import AttentionMaskVisualizer visualizer = AttentionMaskVisualizer("meta-llama/Llama-2-7b-hf") visualizer("Plants create energy through a process known as") ```
## Notes - Setting `config.pretraining_tp` to a value besides `1` activates a more accurate but slower computation of the linear layers. This matches the original logits better. - The original model uses `pad_id = -1` to indicate a padding token. The Transformers implementation requires adding a padding token and resizing the token embedding accordingly. ```py tokenizer.add_special_tokens({"pad_token":""}) # update model config with padding token model.config.pad_token_id ``` - It is recommended to initialize the `embed_tokens` layer with the following code to ensure encoding the padding token outputs zeros. ```py self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx) ``` - The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, "Banana"), the tokenizer doesn't prepend the prefix space to the string. - Don't use the `torch_dtype` parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to `True` if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast). ## LlamaConfig [[autodoc]] LlamaConfig ## LlamaTokenizer [[autodoc]] LlamaTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary ## LlamaTokenizerFast [[autodoc]] LlamaTokenizerFast - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - update_post_processor - save_vocabulary ## LlamaModel [[autodoc]] LlamaModel - forward ## LlamaForCausalLM [[autodoc]] LlamaForCausalLM - forward ## LlamaForSequenceClassification [[autodoc]] LlamaForSequenceClassification - forward