mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 14:20:04 +06:00

* 20250508 Model Architecture * Update modeling_glm4v.py * Update modeling_glm4v.py * Update modeling_glm4v.py * update 1447 * 0526 * update * format * problem * update * update with only image embed diff * Final * upload * update * 1 * upload with ruff * update * update * work * 1 * 1 * update with new note * 2 * Update convert_glm4v_mgt_weights_to_hf.py * Update tokenization_auto.py * update with new format * remove rmsnrom * draft with videos * draft * update * update * fix for review problem * try to remove min_pixel * update * for test * remove timestamps * remove item * update with remove * change * update 2200 * update * Delete app.py * format * update * Update test_video_processing_glm4v.py * 1 * 2 * use new name * Update test_video_processing_glm4v.py * remove docs * change * update for image processors update * 2108 * 2128 * Update modular_glm4v.py * 1 * update some * update * rename * 1 * remove tests output * 2 * add configuration * update * Update test_video_processing_glm4v.py * fix simple forward tests * update with modular * 1 * fix more tests * fix generation test * fix beam search and init * modular changed * fix beam search in case of single-image/video. Fails if multiple visuals per text * update processor * update test * pass * fix beam search * update * param correct * Update convert_glm4v_mgt_weights_to_hf.py * 1 * Update test_modeling_glm4v.py * 4 * 2 * 2123 video process * 2 * revert * 1 * 2 * revert processing * update preprocesor * changed * 1 * update * update * 6 * update * update * update * Delete tmp.txt * config * Update video_processing_glm4v.py * apply modular correctly * move functions * fix order * update the longest_edge * style * simplify a lot * fix random order of classes * skip integration tests * correctly fix the tests * fix TP plan --------- Co-authored-by: raushan <raushan@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
181 lines
5.0 KiB
Markdown
181 lines
5.0 KiB
Markdown
<!--Copyright 2025 The ZhipuAI Inc. and The HuggingFace Inc. team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
<div style="float: right;">
|
|
<div class="flex flex-wrap space-x-1">
|
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
|
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
|
|
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
|
|
</div>
|
|
|
|
# GLM-4.1V
|
|
|
|
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
|
|
|
<hfoptions id="usage">
|
|
<hfoption id="Pipeline">
|
|
|
|
```py
|
|
import torch
|
|
from transformers import pipeline
|
|
pipe = pipeline(
|
|
task="image-text-to-text",
|
|
model="THUDM/GLM-4.1V-9B-Thinking",
|
|
device=0,
|
|
torch_dtype=torch.bfloat16
|
|
)
|
|
messages = [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "image",
|
|
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
|
|
},
|
|
{ "type": "text", "text": "Describe this image."},
|
|
]
|
|
}
|
|
]
|
|
pipe(text=messages,max_new_tokens=20, return_full_text=False)
|
|
```
|
|
</hfoption>
|
|
<hfoption id="AutoModel">
|
|
|
|
```py
|
|
import torch
|
|
from transformers import Glm4vForConditionalGeneration, AutoProcessor
|
|
|
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
|
"THUDM/GLM-4.1V-9B-Thinking",
|
|
torch_dtype=torch.bfloat16,
|
|
device_map="auto",
|
|
attn_implementation="sdpa"
|
|
)
|
|
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
|
|
messages = [
|
|
{
|
|
"role":"user",
|
|
"content":[
|
|
{
|
|
"type":"image",
|
|
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
|
|
},
|
|
{
|
|
"type":"text",
|
|
"text":"Describe this image."
|
|
}
|
|
]
|
|
}
|
|
|
|
]
|
|
|
|
inputs = processor.apply_chat_template(
|
|
messages,
|
|
add_generation_prompt=True,
|
|
tokenize=True,
|
|
return_dict=True,
|
|
return_tensors="pt"
|
|
).to("cuda")
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
|
generated_ids_trimmed = [
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
|
]
|
|
output_text = processor.batch_decode(
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
|
)
|
|
print(output_text)
|
|
```
|
|
</hfoption>
|
|
</hfoptions>
|
|
|
|
Using GLM-4.1V with video input is similar to using it with image input.
|
|
The model can process video data and generate text based on the content of the video.
|
|
|
|
```python
|
|
from transformers import AutoProcessor, Glm4vForConditionalGeneration
|
|
import torch
|
|
|
|
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
|
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
|
pretrained_model_name_or_path="THUDM/GLM-4.1V-9B-Thinking",
|
|
torch_dtype=torch.bfloat16,
|
|
device_map="cuda:0"
|
|
)
|
|
|
|
messages = [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "video",
|
|
"url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
|
|
},
|
|
{
|
|
"type": "text",
|
|
"text": "discribe this video",
|
|
},
|
|
],
|
|
}
|
|
]
|
|
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True).to("cuda:0")
|
|
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0)
|
|
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True)
|
|
print(output_text)
|
|
```
|
|
|
|
## Glm4vConfig
|
|
|
|
[[autodoc]] Glm4vConfig
|
|
|
|
## Glm4vTextConfig
|
|
|
|
[[autodoc]] Glm4vTextConfig
|
|
|
|
## Glm4vImageProcessor
|
|
|
|
[[autodoc]] Glm4vImageProcessor
|
|
- preprocess
|
|
|
|
## Glm4vVideoProcessor
|
|
|
|
[[autodoc]] Glm4vVideoProcessor
|
|
- preprocess
|
|
|
|
## Glm4vImageProcessorFast
|
|
|
|
[[autodoc]] Glm4vImageProcessorFast
|
|
- preprocess
|
|
|
|
## Glm4vProcessor
|
|
|
|
[[autodoc]] Glm4vProcessor
|
|
|
|
## Glm4vTextModel
|
|
|
|
[[autodoc]] Glm4vTextModel
|
|
- forward
|
|
|
|
## Glm4vModel
|
|
|
|
[[autodoc]] Glm4vModel
|
|
- forward
|
|
|
|
## Glm4vForConditionalGeneration
|
|
|
|
[[autodoc]] Glm4vForConditionalGeneration
|
|
- forward
|