# Mamba [Mamba](https://huggingface.co/papers/2312.00752) is a selective structured state space model (SSMs) designed to work around Transformers computational inefficiency when dealing with long sequences. It is a completely attention-free architecture, and comprised of a combination of H3 and gated MLP blocks (Mamba block). Mamba's "content-based reasoning" allows it to focus on specific parts of an input depending on the current token. Mamba also uses a new hardware-aware parallel algorithm to compensate for the lack of convolutional operations. As a result, Mamba has fast inference and can scale to very long sequences. You can find all the original Mamba checkpoints under the [State Space Models](https://huggingface.co/state-spaces) organization. > [!TIP] > Click on the Mamba models in the right sidebar for more examples of how to apply Mamba to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line. ```py import torch from transformers import pipeline pipeline = pipeline( task="text-generation", model="state-spaces/mamba-130m-hf", torch_dtype=torch.float16, device=0 ) pipeline("Plants create energy through a process known as") ``` ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf") model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-130m-hf", torch_dtype=torch.float16, device_map="auto",) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") output = model.generate(**input_ids) print(tokenizer.decode(output[0], skip_special_tokens=True) ``` ```bash echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model state-spaces/mamba-130m-hf --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bit integers. ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig from torchao.quantization import Int4WeightOnlyConfig quantization_config = Int4WeightOnlyConfig(group_size=128) quantization_config = TorchAoConfig(quant_type=quant_config) tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-2.8b-hf") model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf", torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto",) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") output = model.generate(**input_ids) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Notes - The current implementation uses the original CUDA kernels. The FlashAttention equivalent implementation is hosted in the [mamba-ssm](https://github.com/state-spaces/mamba) and [causal_conv1d](https://github.com/Dao-AILab/causal-conv1d) repositories. Make sure to install them if your hardware supports it! - Mamba stacks `mixer` layers which are equivalent to `Attention` layers. You can find the main logic of Mamba in the `MambaMixer` class. - The example below demonstrates how to fine-tune Mamba with [PEFT](https://huggingface.co/docs/peft). ```py from datasets import load_dataset from trl import SFTConfig, SFTTrainer from peft import LoraConfig model_id = "state-spaces/mamba-130m-hf" dataset = load_dataset("Abirate/english_quotes", split="train") training_args = SFTConfig(dataset_text_field="quote") lora_config = LoraConfig(target_modules=["x_proj", "embeddings", "in_proj", "out_proj"]) trainer = SFTTrainer( model=model_id, args=training_args, train_dataset=dataset, peft_config=lora_config, ) trainer.train() ``` ## MambaConfig [[autodoc]] MambaConfig ## MambaModel [[autodoc]] MambaModel - forward ## MambaLMHeadModel [[autodoc]] MambaForCausalLM - forward