
* Add MLCD model * Update codes for auto-mapping * Add test scripts for MLCD * Update doc for MLCD model * Fix import error * Fix import error * Fix CI error for attention_outputs * Fix code style for CI * Fix code style for CI * Fix code style for CI * Fix code style for CI * Fix code style for CI * Fix CI error for initialization * Fix code style for CI * Fix code style for CI * Reformat codes and docs for CI test * Reformat codes and docs for CI test * Remove unused attributes for CI test * Fix style for CI test * List MLCD in flash_attn doc * Fix: typos, modulars, refactors from suggestions * Refactoring convert_mlcd_weights_to_hf.py from suggestions * Fix: docs conflicts * Fix error for CI test * Fix style for CI test * Add integration test for MLCD * Refactoring by class inheritance * Fix: refactor attention interface, adjust codes * Fix: merging conflicts * Fix: merging conflicts * Fix: style for CI test * Fix: style for CI test * Fix: set test_resize_embeddings to be False * Fix: initializer for CI test * Fix: conflicts, CI test, warning and refactoring * Fix: merging conflicts * Refactor * Update docs * Fix mistakes * Remove unused args and fix multi-gpu error * Revert position_embeddings * Solve conflicts * Solve conflicts * Remove dummy * Update _init_weights * Update _init_weights * Update _init_weights for CI test
4.2 KiB
MLCD
Overview
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
🔥MLCD-ViT-bigG🔥 series is the state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
Tips:
-
We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models.
-
The language model is Qwen2.5-7B.
Result:
Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
---|---|---|---|---|---|---|
CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | 48.00 |
MLCD (ViT-L-14-336px) | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
MLCD (ViT-bigG-14-336px) | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
MLCD (ViT-bigG-14-448px) | √ | 73.80 | 83.34 | 46.59 | 582.00 | 46.00 |
Usage
import requests
from PIL import Image
from transformers import AutoProcessor, MLCDVisionModel
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
# Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Generate outputs
with torch.no_grad():
outputs = model(**inputs)
# Get visual features
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")
MLCDVisionConfig
autodoc MLCDVisionConfig
MLCDVisionModel
autodoc MLCDVisionModel - forward