Microsoft's 5.6B-parameter open multimodal foundation model that jointly processes text, vision, and audio in a single neural network, with strong ASR performance that ranked #1 on the Hugging Face Open ASR Leaderboard at launch.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Phi-4-multimodal-instruct is a lightweight open multimodal model from Microsoft that unifies text, image, and audio inputs through a single transformer backbone with mixture-of-LoRAs modality adapters (dedicated speech-LoRA and vision-LoRA). It builds on research and data from Phi-3.5 and Phi-4.0 and supports a 128K token context length.
Phi4MMForCausalLM) with 5.6B parameters.Trained on ~512 A100-80G GPUs for 28 days over ~5T text tokens, ~2.3M hours of speech, and ~1.1T image-text tokens. Data cutoff: June 2024.
On-device multimodal assistants, transcription + reasoning pipelines, voice-controlled apps, document + audio understanding, accessibility, and edge/IoT deployments.