Mistral AI's compact 3B-parameter open-weights audio-language model built on Ministral-3B with a Whisper-derived encoder, designed for transcription, audio Q&A, summarization, and function-calling from voice across 8+ languages.
A solid 3B-parameter dense audio model from Mistral AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Multimodal audio-language model combining a Whisper-style audio encoder with the Ministral-3B decoder LLM through a multi-modal projector (multi_modal_projector.linear_1 / linear_2). Implemented in Hugging Face Transformers as VoxtralForConditionalGeneration (≥ 4.54.0).
Modality: Audio + text in / text out. 32k token context window; supports audios up to ~30 minutes for transcription and ~40 minutes for understanding.
Capabilities:
Training & deployment: Released July 15, 2025 (paper arXiv 2507.13264). ~9.5 GB GPU RAM in bf16/fp16; runs on a single GPU. Distributed under Apache 2.0 with optimized hosted inference on Mistral's La Plateforme (Voxtral Mini Transcribe variant from $0.001/audio-minute).
Use cases: Cost-sensitive transcription, voice agents, multilingual meeting/call transcripts, edge and local deployments where Whisper-class quality is needed at sub-Whisper cost.