# Granite
[Granite](https://huggingface.co/papers/2408.13359) is a 3B parameter language model trained with the Power scheduler. Discovering a good learning rate for pretraining large language models is difficult because it depends on so many variables (batch size, number of training tokens, etc.) and it is expensive to perform a hyperparameter search. The Power scheduler is based on a power-law relationship between the variables and their transferability to larger models. Combining the Power scheduler with Maximum Update Parameterization (MUP) allows a model to be pretrained with one set of hyperparameters regardless of all the variables.
You can find all the original Granite checkpoints under the [IBM-Granite](https://huggingface.co/ibm-granite) organization.
> [!TIP]
> Click on the Granite models in the right sidebar for more examples of how to apply Granite to different language tasks.
The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`, and from the command line.
```python
import torch
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="ibm-granite/granite-3.3-2b-base",
torch_dtype=torch.bfloat16,
device=0
)
pipe("Explain quantum computing in simple terms ", max_new_tokens=50)
```
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.3-2b-base")
model = AutoModelForCausalLM.from_pretrained(
"ibm-granite/granite-3.3-2b-base",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("Explain quantum computing in simple terms", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50, cache_implementation="static")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
```python
echo -e "Explain quantum computing simply." | transformers-cli run --task text-generation --model ibm-granite/granite-3.3-8b-instruct --device 0
```
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.3-8b-base")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.3-8b-base", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa", quantization_config=quantization_config)
inputs = tokenizer("Explain quantum computing in simple terms", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50, cache_implementation="static")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(""ibm-granite/granite-3.3-2b-base"")
model = AutoModelForCausalLM.from_pretrained(
"ibm-granite/granite-3.3-2b-base",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
quantization_config=quantization_config,
)
input_ids = tokenizer("Explain artificial intelligence to a 10 year old", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50, cache_implementation="static")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## GraniteConfig
[[autodoc]] GraniteConfig
## GraniteModel
[[autodoc]] GraniteModel
- forward
## GraniteForCausalLM
[[autodoc]] GraniteForCausalLM
- forward