Mistral AI's flagship 24B-parameter open-weights audio-language model built on Mistral Small 3.1 with a Whisper-derived encoder, delivering state-of-the-art transcription, translation, and audio understanding in 8+ languages.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Multimodal audio-language model pairing a Whisper-style audio encoder with the Mistral Small 3.1 (24B) decoder LLM via a multi-modal projector. Implemented as VoxtralForConditionalGeneration in Hugging Face Transformers; recommended deployment via vLLM with --tensor-parallel-size 2 (≈55 GB GPU RAM in bf16/fp16).
Modality: Audio + text in / text out. 32k-token context window, handling audios up to ~30 min for transcription and ~40 min for audio understanding.
Capabilities:
--tool-call-parser mistral --enable-auto-tool-choice.Training & release: Released July 15, 2025 alongside Voxtral Mini (paper arXiv 2507.13264). Distributed under Apache 2.0; available on Mistral's La Plateforme, Le Chat voice mode, and via private deployment.
Use cases: Production-scale transcription, multilingual voice agents, meeting/call analysis, podcast and media indexing, voice-driven workflow automation in regulated environments.