Mistral AI's compact 3B-parameter open-weights audio-language model built on Ministral-3B with a Whisper-derived encoder, designed for transcription, audio Q&A, summarization, and function-calling from voice across 8+ languages.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Multimodal audio-language model combining a Whisper-style audio encoder with the Ministral-3B decoder LLM through a multi-modal projector (multi_modal_projector.linear_1 / linear_2). Implemented in Hugging Face Transformers as VoxtralForConditionalGeneration (≥ 4.54.0).
Modality: Audio + text in / text out. 32k token context window; supports audios up to ~30 minutes for transcription and ~40 minutes for understanding.
Capabilities:
Training & deployment: Released July 15, 2025 (paper arXiv 2507.13264). ~9.5 GB GPU RAM in bf16/fp16; runs on a single GPU. Distributed under Apache 2.0 with optimized hosted inference on Mistral's La Plateforme (Voxtral Mini Transcribe variant from $0.001/audio-minute).
Use cases: Cost-sensitive transcription, voice agents, multilingual meeting/call transcripts, edge and local deployments where Whisper-class quality is needed at sub-Whisper cost.