diff --git a/docs/source/en/model_doc/bamba.md b/docs/source/en/model_doc/bamba.md
index b776a5732a0..81f8f79a58a 100644
--- a/docs/source/en/model_doc/bamba.md
+++ b/docs/source/en/model_doc/bamba.md
@@ -14,84 +14,127 @@ rendered properly in your Markdown viewer.
-->
-# Bamba
-
-
-

-

-

+
-## Overview
+# Bamba
-Bamba-9B is a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
+[Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mamba2) architecture. It is pretrained in two stages - it starts by training on 2T tokens from the [Dolma v1.7](https://huggingface.co/datasets/allenai/dolma) dataset and then trained on an additional 200B tokens from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia).
-Checkout all Bamba-9B model checkpoints [here](https://github.com/foundation-model-stack/bamba).
+You can find all the original Bamba checkpoints under the [Bamba](https://huggingface.co/collections/ibm-ai-platform/bamba-674f1388b9bbc98b413c7bab) collection.
+
+> [!TIP]
+> This model was contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).
+>
+> Click on the Bamba models in the right sidebar for more examples of how to apply Bamba to different text generation tasks.
+
+The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
+
+
+
+
+```python
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+ task="text-generation",
+ model="ibm-ai-platform/Bamba-9B-v2",
+ torch_dtype=torch.bfloat16,
+ device=0
+)
+pipeline("Plants create energy through a process known as")
+```
+
+
+
+
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
+model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-v2", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+
+output = model.generate(**input_ids)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+
+
+
+```bash
+echo "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ibm-ai-platform/Bamba-9B-v2 --device 0
+```
+
+
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
+model = AutoModelForCausalLM.from_pretrained(
+ "ibm-ai-platform/Bamba-9B-v2",
+ quantization_config=quantization_config,
+ device_map="auto",
+ attn_implementation="sdpa"
+)
+
+inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+output = model.generate(**inputs)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+## Notes
+
+- Bamba supports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by [~2x](https://github.com/huggingface/transformers/pull/35861#issue-2807873129) (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens.
+
+ Padding-free training requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` packages and the following arguments must be passed to the model in addition to `input_ids` and `labels`.
+
+ - `position_ids: torch.LongTensor`: the position index of each token in each sequence.
+ - `seq_idx: torch.IntTensor`: the index of each sequence in the batch.
+ - Each of the [`FlashAttentionKwargs`]
+ - `cu_seq_lens_q: torch.LongTensor`: the cumulative sequence lengths of all queries.
+ - `cu_seq_lens_k: torch.LongTensor`: the cumulative sequence lengths of all keys.
+ - `max_length_q: int`: the longest query length in the batch.
+ - `max_length_k: int`: the longest key length in the batch.
+
+ The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] programmatically generates the set of additional arguments above using `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for additional information.
+
+ ```python
+ from transformers import DataCollatorWithFlattening
+
+ # Example of using padding-free training
+ data_collator = DataCollatorWithFlattening(
+ tokenizer=tokenizer,
+ return_seq_idx=True,
+ return_flash_attn_kwargs=True
+ )
+ ```
## BambaConfig
-| Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings |
-|-------------------|--------------|----------|-------------|-----------------|-----|----------|----------------|------------------|
-| Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | True |
-
[[autodoc]] BambaConfig
-
## BambaForCausalLM
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B")
-tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")
-
-message = ["Mamba is a snake with following properties "]
-inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
-response = model.generate(**inputs, max_new_tokens=64)
-print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
-```
-
-
-## Padding-Free Training
-
-Bamba supports padding-free training in which distinct training examples can be concatenated
-together while nevertheless processing the inputs as though they belonged to separate batches. When
-the examples are of varying lengths, padding-free training can provide significant speed ups and
-memory savings compared to batching the examples together and using padding, as the unnecessary
-compute and memory due to padding is avoided entirely. The performance gains depend on factors such
-as the model and the data distribution, but throughput gains up to [~2x are commonly
-seen](https://github.com/huggingface/transformers/pull/35861#issue-2807873129).
-
-Using padding-free training with Bamba requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d`
-packages, and the following arguments must be passed to the model in addition to `input_ids` and
-`labels`:
-* `position_ids: torch.LongTensor`: the position index of each token in each sequence.
-* `seq_idx: torch.IntTensor`: the index of each sequence in the batch.
-* Each of the [`FlashAttentionKwargs`]
- * `cu_seq_lens_q: torch.LongTensor`: The cumulative sequence lengths of all queries.
- * `cu_seq_lens_k: torch.LongTensor`: The cumulative sequence lengths of all keys.
- * `max_length_q: int`: the longest query length in the batch.
- * `max_length_k: int`: the longest key length in the batch.
-
-The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] can be used
-to programmatically generate the above set of additional arguments using `return_seq_idx=True` and
-`return_flash_attn_kwargs=True`. See [this blog post](https://huggingface.co/blog/packing-with-FA2)
-for additional information.
-
-
[[autodoc]] BambaForCausalLM
- forward
-
-This HF implementation is contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).