Microsoft's 5.6B-parameter open multimodal foundation model that jointly processes text, vision, and audio in a single neural network, with strong ASR performance that ranked #1 on the Hugging Face Open ASR Leaderboard at launch.
A solid 5.6B-parameter dense audio model from Microsoft. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Phi-4-multimodal-instruct is a lightweight open multimodal model from Microsoft that unifies text, image, and audio inputs through a single transformer backbone with mixture-of-LoRAs modality adapters (dedicated speech-LoRA and vision-LoRA). It builds on research and data from Phi-3.5 and Phi-4.0 and supports a 128K token context length.
Phi4MMForCausalLM) with 5.6B parameters.Trained on ~512 A100-80G GPUs for 28 days over ~5T text tokens, ~2.3M hours of speech, and ~1.1T image-text tokens. Data cutoff: June 2024.
On-device multimodal assistants, transcription + reasoning pipelines, voice-controlled apps, document + audio understanding, accessibility, and edge/IoT deployments.