From 3b3ebcec4077f124f2cd0ec3cd5d028dc352a3e5 Mon Sep 17 00:00:00 2001 From: Andy Vu Date: Tue, 27 May 2025 16:24:36 -0700 Subject: [PATCH] Updated model card for OLMo2 (#38394) * Updated OLMo2 model card * added command line * Add suggestions Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Added suggestions Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Indented code block as per suggestions --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/olmo2.md | 122 ++++++++++++++++++++++++++---- 1 file changed, 107 insertions(+), 15 deletions(-) diff --git a/docs/source/en/model_doc/olmo2.md b/docs/source/en/model_doc/olmo2.md index 24030b85524..1ed21b660f1 100644 --- a/docs/source/en/model_doc/olmo2.md +++ b/docs/source/en/model_doc/olmo2.md @@ -14,27 +14,119 @@ rendered properly in your Markdown viewer. --> -# OLMo2 - -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
-## Overview +# OLMo2 +[OLMo2](https://huggingface.co/papers/2501.00656) improves on [OLMo](./olmo) by changing the architecture and training recipes of the original models. This includes excluding all biases to improve training stability, non-parametric layer norm, SwiGLU activation function, rotary positional embeddings, and a modified BPE-based tokenizer that masks personal identifiable information. It is pretrained on [Dolma](https://huggingface.co/datasets/allenai/dolma), a dataset of 3T tokens. -The OLMo2 model is the successor of the OLMo model, which was proposed in -[OLMo: Accelerating the Science of Language Models](https://arxiv.org/abs/2402.00838). +You can find all the original OLMo2 checkpoints under the [OLMo2](https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc) collection. - The architectural changes from the original OLMo model to this model are: +> [!TIP] +> Click on the OLMo2 models in the right sidebar for more examples of how to apply OLMo2 to different language tasks. -- RMSNorm is used instead of standard layer norm. -- Norm is applied to attention queries and keys. -- Norm is applied after attention/feedforward layers rather than before. +The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line. -This model was contributed by [shanearora](https://huggingface.co/shanearora). -The original code can be found [here](https://github.com/allenai/OLMo/tree/main/olmo). + + + +```py +import torch +from transformers import pipeline + +pipe = pipeline( + task="text-generation", + model="allenai/OLMo-2-0425-1B", + torch_dtype=torch.float16, + device=0, +) + +result = pipe("Plants create energy through a process known as") +print(result) +``` + + + + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + "allenai/OLMo-2-0425-1B" +) + +model = AutoModelForCausalLM.from_pretrained( + "allenai/OLMo-2-0425-1B", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +) +input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) + +output = model.generate(**input_ids, max_length=50, cache_implementation="static") +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + + + + +```bash +echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model allenai/OLMo-2-0425-1B --device 0 +``` + + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. +```py + +#pip install torchao +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig + +torchao_config = TorchAoConfig( + "int4_weight_only", + group_size=128 +) + +tokenizer = AutoTokenizer.from_pretrained( + "allenai/OLMo-2-0425-1B" +) + +model = AutoModelForCausalLM.from_pretrained( + "allenai/OLMo-2-0425-1B", + quantization_config=torchao_config, + torch_dtype=torch.bfloat16, + device_map="auto", + attn_implementation="sdpa" +) +input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) + +output = model.generate(**input_ids, max_length=50, cache_implementation="static") +print(tokenizer.decode(output[0], skip_special_tokens=True)) + +``` + + +## Notes + +- OLMo2 uses RMSNorm instead of standard layer norm. The RMSNorm is applied to attention queries and keys, and it is applied after the attention and feedforward layers rather than before. +- OLMo2 requires Transformers v4.48 or higher. +- Load specific intermediate checkpoints by adding the `revision` parameter to [`~PreTrainedModel.from_pretrained`]. + + ```py + from transformers import AutoModelForCausalLM + + model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0425-1B", revision="stage1-step140000-tokens294B") + ``` ## Olmo2Config