PyTorch FlashAttention SDPA
# Granite [Granite](https://huggingface.co/papers/2408.13359) is a 3B parameter language model trained with the Power scheduler. Discovering a good learning rate for pretraining large language models is difficult because it depends on so many variables (batch size, number of training tokens, etc.) and it is expensive to perform a hyperparameter search. The Power scheduler is based on a power-law relationship between the variables and their transferability to larger models. Combining the Power scheduler with Maximum Update Parameterization (MUP) allows a model to be pretrained with one set of hyperparameters regardless of all the variables. You can find all the original Granite checkpoints under the [IBM-Granite](https://huggingface.co/ibm-granite) organization. > [!TIP] > Click on the Granite models in the right sidebar for more examples of how to apply Granite to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`, and from the command line. ```python import torch from transformers import pipeline pipe = pipeline( task="text-generation", model="ibm-granite/granite-3.3-2b-base", torch_dtype=torch.bfloat16, device=0 ) pipe("Explain quantum computing in simple terms ", max_new_tokens=50) ``` ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.3-2b-base") model = AutoModelForCausalLM.from_pretrained( "ibm-granite/granite-3.3-2b-base", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa" ) inputs = tokenizer("Explain quantum computing in simple terms", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=50, cache_implementation="static") print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ```python echo -e "Explain quantum computing simply." | transformers-cli run --task text-generation --model ibm-granite/granite-3.3-8b-instruct --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_4bit=True) tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.3-8b-base") model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.3-8b-base", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa", quantization_config=quantization_config) inputs = tokenizer("Explain quantum computing in simple terms", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=50, cache_implementation="static") print(tokenizer.decode(outputs[0], skip_special_tokens=True)) quantization_config = BitsAndBytesConfig(load_in_4bit=True) tokenizer = AutoTokenizer.from_pretrained(""ibm-granite/granite-3.3-2b-base"") model = AutoModelForCausalLM.from_pretrained( "ibm-granite/granite-3.3-2b-base", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa", quantization_config=quantization_config, ) input_ids = tokenizer("Explain artificial intelligence to a 10 year old", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=50, cache_implementation="static") print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## GraniteConfig [[autodoc]] GraniteConfig ## GraniteModel [[autodoc]] GraniteModel - forward ## GraniteForCausalLM [[autodoc]] GraniteForCausalLM - forward