Microsoft

Phi-4-multimodal-instruct

Microsoft's 5.6B-parameter open multimodal foundation model that jointly processes text, vision, and audio in a single neural network, with strong ASR performance that ranked #1 on the Hugging Face Open ASR Leaderboard at launch.

5.6B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters5.6B

ArchitectureDense

Training Cutoff2024-06

ProviderMicrosoft

Download Size24.0 GB

Community

Monthly Downloads362.4K

Likes1.6K

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

WER

6.0%

Overall Score

64.9BB

Benchmark40%

88.0

Popularity25%

84.0

Efficiency25%

6.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	3.9 GB
AMD Instinct MI300XAMD	SS	3.9 GB
AMD Instinct MI325XAMD	SS	3.9 GB
AMD Instinct MI355XAMD	SS	3.9 GB
AMD Radeon RX 7600 8GBAMD	SS	3.9 GB
AMD Radeon RX 7700 XTAMD	SS	3.9 GB
AMD Radeon RX 7800 XTAMD	SS	3.9 GB
AMD Radeon RX 7900 XTAMD	SS	3.9 GB
AMD Radeon RX 7900 XTXAMD	SS	3.9 GB
AMD Radeon RX 9070AMD	SS	3.9 GB
AMD Radeon RX 9070 XTAMD	SS	3.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	3.9 GB
Apple M4Apple	SS	3.9 GB
Apple M4 Max (40-core GPU)Apple	SS	3.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	3.9 GB
Apple M5Apple	SS	3.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	3.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	3.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	3.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	3.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	3.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	3.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	3.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	3.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	3.9 GB

Rows per page

Page 1 of 4

About This Model

Overview

Phi-4-multimodal-instruct is a lightweight open multimodal model from Microsoft that unifies text, image, and audio inputs through a single transformer backbone with mixture-of-LoRAs modality adapters (dedicated speech-LoRA and vision-LoRA). It builds on research and data from Phi-3.5 and Phi-4.0 and supports a 128K token context length.

Architecture

Decoder-only transformer (Phi4MMForCausalLM) with 5.6B parameters.
Speech encoder + speech-LoRA adapter and vision encoder + vision-LoRA adapter feed into the shared LLM.
Post-trained with supervised fine-tuning, DPO, and RLHF for instruction following and safety.

Audio / ASR Capabilities

ASR and speech translation for 8 languages: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese.
Surpasses Whisper-v3 on ASR and SeamlessM4T-v2-Large on ST; ranked #1 on the HF Open ASR Leaderboard at release (March 2025).
First open-sourced model capable of speech summarization; also supports speech QA.

Training

Trained on ~512 A100-80G GPUs for 28 days over ~5T text tokens, ~2.3M hours of speech, and ~1.1T image-text tokens. Data cutoff: June 2024.

Use Cases

On-device multimodal assistants, transcription + reasoning pipelines, voice-controlled apps, document + audio understanding, accessibility, and edge/IoT deployments.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

5.6B

Microsoft

Phi-4-multimodal-instruct

5.6B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters5.6B

ArchitectureDense

Training Cutoff2024-06

ProviderMicrosoft

Download Size24.0 GB

Community

Monthly Downloads362.4K

Likes1.6K

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

WER

6.0%

Overall Score

64.9BB

Benchmark40%

88.0

Popularity25%

84.0

Efficiency25%

6.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	3.9 GB
AMD Instinct MI300XAMD	SS	3.9 GB
AMD Instinct MI325XAMD	SS	3.9 GB
AMD Instinct MI355XAMD	SS	3.9 GB
AMD Radeon RX 7600 8GBAMD	SS	3.9 GB
AMD Radeon RX 7700 XTAMD	SS	3.9 GB
AMD Radeon RX 7800 XTAMD	SS	3.9 GB
AMD Radeon RX 7900 XTAMD	SS	3.9 GB
AMD Radeon RX 7900 XTXAMD	SS	3.9 GB
AMD Radeon RX 9070AMD	SS	3.9 GB
AMD Radeon RX 9070 XTAMD	SS	3.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	3.9 GB
Apple M4Apple	SS	3.9 GB
Apple M4 Max (40-core GPU)Apple	SS	3.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	3.9 GB
Apple M5Apple	SS	3.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	3.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	3.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	3.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	3.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	3.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	3.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	3.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	3.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	3.9 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture

Decoder-only transformer (Phi4MMForCausalLM) with 5.6B parameters.
Speech encoder + speech-LoRA adapter and vision encoder + vision-LoRA adapter feed into the shared LLM.
Post-trained with supervised fine-tuning, DPO, and RLHF for instruction following and safety.

Audio / ASR Capabilities

ASR and speech translation for 8 languages: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese.
Surpasses Whisper-v3 on ASR and SeamlessM4T-v2-Large on ST; ranked #1 on the HF Open ASR Leaderboard at release (March 2025).
First open-sourced model capable of speech summarization; also supports speech QA.

Training

Trained on ~512 A100-80G GPUs for 28 days over ~5T text tokens, ~2.3M hours of speech, and ~1.1T image-text tokens. Data cutoff: June 2024.

Use Cases

On-device multimodal assistants, transcription + reasoning pipelines, voice-controlled apps, document + audio understanding, accessibility, and edge/IoT deployments.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.