transformers/docs/source/en/model_doc/smollm3.md
Anton Lozhkov dad0e87c79
Add SmolLM3 (#38755)
* init smollm3

* integration tests

* config quirks

* docs stub

* rests round 2

* tests round 3

* tests round 4

* bring SWA back

* config checker pls

* final checkpoint

* style and copies

* Update src/transformers/models/smollm3/modular_smollm3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/smollm3/modular_smollm3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-06-25 15:12:15 +00:00

5.5 KiB

PyTorch FlashAttention SDPA

SmolLM3

SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.

Tip

Click on the SmolLM3 models in the right sidebar for more examples of how to apply SmolLM3 to different language tasks.

The example below demonstrates how to generate text with [Pipeline], [AutoModel], and from the command line using the instruction-tuned models.

import torch
from transformers import pipeline

pipe = pipeline(
    task="text-generation",
    model="HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map=0
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about yourself."},
]
outputs = pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"][-1]['content'])
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa"
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")

generated_ids = model.generate(
    model_inputs.input_ids,
    cache_implementation="static",
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# pip install -U flash-attn --no-build-isolation
transformers chat HuggingFaceTB/SmolLM3-3B --torch_dtype auto --attn_implementation flash_attention_2 --device 0

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to quantize the weights to 4-bits.

# pip install -U flash-attn --no-build-isolation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2"
)

inputs = tokenizer("Gravity is the force", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

  • Ensure your Transformers library version is up-to-date. SmolLM3 requires Transformers>=4.53.0 for full support.

SmolLM3Config

autodoc SmolLM3Config

SmolLM3Model

autodoc SmolLM3Model - forward

SmolLM3ForCausalLM

autodoc SmolLM3ForCausalLM - forward

SmolLM3ForSequenceClassification

autodoc SmolLM3ForSequenceClassification - forward

SmolLM3ForTokenClassification

autodoc SmolLM3ForTokenClassification - forward

SmolLM3ForQuestionAnswering

autodoc SmolLM3ForQuestionAnswering - forward