# Phi4 Multimodal ## Overview Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py). ## Usage tips `Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio). ```python import requests import torch import os import io from PIL import Image import soundfile as sf from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from urllib.request import urlopen # Define model path model_path = "microsoft/Phi-4-multimodal-instruct" device = "cuda:0" # Load model and processor processor = AutoProcessor.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16) # Optional: load the adapters (note that without them, the base model will very likely not work well) model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'}) model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'}) # Define prompt structure user_prompt = '<|user|>' assistant_prompt = '<|assistant|>' prompt_suffix = '<|end|>' # Part 1: Image Processing model.set_adapter("vision") # if loaded, activate the vision adapter print("\n--- IMAGE PROCESSING ---") image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg' prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') # Download and open image image = Image.open(requests.get(image_url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors='pt').to(device) # Generate response generate_ids = model.generate( **inputs, max_new_tokens=1000, do_sample=False, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}') # Part 2: Audio Processing model.set_adapter("speech") # if loaded, activate the speech adapter print("\n--- AUDIO PROCESSING ---") audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac" speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use as a separator between the original transcript and the translation." prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') # Downlowd and open audio file audio, sample_rate = sf.read(io.BytesIO(urlopen(audio_url).read())) # Process with the model inputs = processor(text=prompt, audios=audio, sample_rate=sample_rate, return_tensors='pt').to(device) generate_ids = model.generate( **inputs, max_new_tokens=1000, do_sample=False, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}') ``` ## Phi4MultimodalFeatureExtractor [[autodoc]] Phi4MultimodalFeatureExtractor ## Phi4MultimodalImageProcessorFast [[autodoc]] Phi4MultimodalImageProcessorFast ## Phi4MultimodalProcessor [[autodoc]] Phi4MultimodalProcessor ## Phi4MultimodalAudioConfig [[autodoc]] Phi4MultimodalAudioConfig ## Phi4MultimodalVisionConfig [[autodoc]] Phi4MultimodalVisionConfig ## Phi4MultimodalConfig [[autodoc]] Phi4MultimodalConfig ## Phi4MultimodalAudioModel [[autodoc]] Phi4MultimodalAudioModel ## Phi4MultimodalVisionModel [[autodoc]] Phi4MultimodalVisionModel ## Phi4MultimodalModel [[autodoc]] Phi4MultimodalModel - forward ## Phi4MultimodalForCausalLM [[autodoc]] Phi4MultimodalForCausalLM - forward