# GraniteMoeHybrid ## Overview The `GraniteMoeHybrid` model builds on top of `GraniteMoeSharedModel` and `Bamba`. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "ibm-granite/granite-4.0-tiny-preview" tokenizer = AutoTokenizer.from_pretrained(model_path) # drop device_map if running on CPU model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto") model.eval() # change input text as desired prompt = "Write a code to find the maximum value in a list of numbers." # tokenize the text input_tokens = tokenizer(prompt, return_tensors="pt") # generate output tokens output = model.generate(**input_tokens, max_new_tokens=100) # decode output tokens into text output = tokenizer.batch_decode(output) # loop over the batch to print, in this example the batch size is 1 for i in output: print(i) ``` This HF implementation is contributed by [Sukriti Sharma](https://huggingface.co/SukritiSharma) and [Alexander Brooks](https://huggingface.co/abrooks9944). ## GraniteMoeHybridConfig [[autodoc]] GraniteMoeHybridConfig ## GraniteMoeHybridModel [[autodoc]] GraniteMoeHybridModel - forward ## GraniteMoeHybridForCausalLM [[autodoc]] GraniteMoeHybridForCausalLM - forward