Mistral AI's flagship 24B-parameter open-weights audio-language model built on Mistral Small 3.1 with a Whisper-derived encoder, delivering state-of-the-art transcription, translation, and audio understanding in 8+ languages.
A solid 24B-parameter dense audio model from Mistral AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Multimodal audio-language model pairing a Whisper-style audio encoder with the Mistral Small 3.1 (24B) decoder LLM via a multi-modal projector. Implemented as VoxtralForConditionalGeneration in Hugging Face Transformers; recommended deployment via vLLM with --tensor-parallel-size 2 (≈55 GB GPU RAM in bf16/fp16).
Modality: Audio + text in / text out. 32k-token context window, handling audios up to ~30 min for transcription and ~40 min for audio understanding.
Capabilities:
--tool-call-parser mistral --enable-auto-tool-choice.Training & release: Released July 15, 2025 alongside Voxtral Mini (paper arXiv 2507.13264). Distributed under Apache 2.0; available on Mistral's La Plateforme, Le Chat voice mode, and via private deployment.
Use cases: Production-scale transcription, multilingual voice agents, meeting/call analysis, podcast and media indexing, voice-driven workflow automation in regulated environments.