diff --git a/docs/source/en/model_doc/bamba.md b/docs/source/en/model_doc/bamba.md index b776a5732a0..81f8f79a58a 100644 --- a/docs/source/en/model_doc/bamba.md +++ b/docs/source/en/model_doc/bamba.md @@ -14,84 +14,127 @@ rendered properly in your Markdown viewer. --> -# Bamba - -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
-## Overview +# Bamba -Bamba-9B is a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality. +[Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mamba2) architecture. It is pretrained in two stages - it starts by training on 2T tokens from the [Dolma v1.7](https://huggingface.co/datasets/allenai/dolma) dataset and then trained on an additional 200B tokens from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia). -Checkout all Bamba-9B model checkpoints [here](https://github.com/foundation-model-stack/bamba). +You can find all the original Bamba checkpoints under the [Bamba](https://huggingface.co/collections/ibm-ai-platform/bamba-674f1388b9bbc98b413c7bab) collection. + +> [!TIP] +> This model was contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim). +> +> Click on the Bamba models in the right sidebar for more examples of how to apply Bamba to different text generation tasks. + +The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line. + + + + +```python +import torch +from transformers import pipeline + +pipeline = pipeline( + task="text-generation", + model="ibm-ai-platform/Bamba-9B-v2", + torch_dtype=torch.bfloat16, + device=0 +) +pipeline("Plants create energy through a process known as") +``` + + + + + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2") +model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-v2", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa") +input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") + +output = model.generate(**input_ids) +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + + + + +```bash +echo "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ibm-ai-platform/Bamba-9B-v2 --device 0 +``` + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4. + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig + +quantization_config = TorchAoConfig("int4_weight_only", group_size=128) +tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2") +model = AutoModelForCausalLM.from_pretrained( + "ibm-ai-platform/Bamba-9B-v2", + quantization_config=quantization_config, + device_map="auto", + attn_implementation="sdpa" +) + +inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") +output = model.generate(**inputs) +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + +## Notes + +- Bamba supports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by [~2x](https://github.com/huggingface/transformers/pull/35861#issue-2807873129) (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens. + + Padding-free training requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` packages and the following arguments must be passed to the model in addition to `input_ids` and `labels`. + + - `position_ids: torch.LongTensor`: the position index of each token in each sequence. + - `seq_idx: torch.IntTensor`: the index of each sequence in the batch. + - Each of the [`FlashAttentionKwargs`] + - `cu_seq_lens_q: torch.LongTensor`: the cumulative sequence lengths of all queries. + - `cu_seq_lens_k: torch.LongTensor`: the cumulative sequence lengths of all keys. + - `max_length_q: int`: the longest query length in the batch. + - `max_length_k: int`: the longest key length in the batch. + + The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] programmatically generates the set of additional arguments above using `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for additional information. + + ```python + from transformers import DataCollatorWithFlattening + + # Example of using padding-free training + data_collator = DataCollatorWithFlattening( + tokenizer=tokenizer, + return_seq_idx=True, + return_flash_attn_kwargs=True + ) + ``` ## BambaConfig -| Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings | -|-------------------|--------------|----------|-------------|-----------------|-----|----------|----------------|------------------| -| Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | True | - [[autodoc]] BambaConfig - ## BambaForCausalLM -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B") -tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B") - -message = ["Mamba is a snake with following properties "] -inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False) -response = model.generate(**inputs, max_new_tokens=64) -print(tokenizer.batch_decode(response, skip_special_tokens=True)[0]) -``` - - -## Padding-Free Training - -Bamba supports padding-free training in which distinct training examples can be concatenated -together while nevertheless processing the inputs as though they belonged to separate batches. When -the examples are of varying lengths, padding-free training can provide significant speed ups and -memory savings compared to batching the examples together and using padding, as the unnecessary -compute and memory due to padding is avoided entirely. The performance gains depend on factors such -as the model and the data distribution, but throughput gains up to [~2x are commonly -seen](https://github.com/huggingface/transformers/pull/35861#issue-2807873129). - -Using padding-free training with Bamba requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` -packages, and the following arguments must be passed to the model in addition to `input_ids` and -`labels`: -* `position_ids: torch.LongTensor`: the position index of each token in each sequence. -* `seq_idx: torch.IntTensor`: the index of each sequence in the batch. -* Each of the [`FlashAttentionKwargs`] - * `cu_seq_lens_q: torch.LongTensor`: The cumulative sequence lengths of all queries. - * `cu_seq_lens_k: torch.LongTensor`: The cumulative sequence lengths of all keys. - * `max_length_q: int`: the longest query length in the batch. - * `max_length_k: int`: the longest key length in the batch. - -The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] can be used -to programmatically generate the above set of additional arguments using `return_seq_idx=True` and -`return_flash_attn_kwargs=True`. See [this blog post](https://huggingface.co/blog/packing-with-FA2) -for additional information. - - [[autodoc]] BambaForCausalLM - forward - -This HF implementation is contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).