
Mistral AI's flagship 24B-parameter open-weights audio-language model built on Mistral Small 3.1 with a Whisper-derived encoder, delivering state-of-the-art transcription, translation, and audio understanding in 8+ languages.
A solid 24B-parameter dense audio model from Mistral AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 15 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Voxtral Small 24B (2507) is Mistral AI's open-weights speech understanding model, built on the Mistral Small 3.1 text backbone with a Whisper-derived audio encoder. At 24B parameters, it occupies a specific niche: a dense model that handles both text and audio input without routing through a separate ASR pipeline. Released under Apache 2.0, it targets developers who need production-grade transcription, translation, and audio comprehension on their own hardware.
This isn't a general-purpose chatbot with audio tacked on. Voxtral Small is purpose-built for speech tasks—transcription, translation, question-answering from audio, and summarization—while retaining the text capabilities of its Mistral Small 3.1 foundation. It competes directly with closed-source audio APIs and other open-weight multimodal models like Meta's SeamlessM4T or Whisper-based pipelines, but with the advantage of native understanding rather than cascaded ASR-plus-LLM architectures.
The model supports 8 languages natively: English, French, German, Spanish, Italian, Portuguese, Dutch, and Hindi. For practitioners evaluating local deployment, the 24B parameter count places it in a range that's demanding but feasible on consumer hardware with proper quantization.
Voxtral Small uses a dense transformer architecture with 24B parameters. Unlike mixture-of-experts models that activate only a subset of parameters per token, dense architectures load the full parameter set into memory during inference. This means VRAM requirements are straightforward to calculate: at 16-bit precision, the model occupies roughly 48 GB. Quantization brings this down significantly—more on that in the local deployment section.
The audio pipeline uses a Whisper-derived encoder to process speech input. This encoder converts audio into embeddings that feed into the Mistral Small 3.1 text decoder, enabling the model to understand spoken language directly without external speech-to-text preprocessing. The context window is 32K tokens, which translates to approximately 30 minutes of audio for transcription or 40 minutes for understanding tasks. This is sufficient for most real-world use cases like meeting transcription, lecture processing, or customer call analysis, but falls short of the 128K+ context windows offered by some text-only competitors.
The model supports function calling from voice input—spoken commands can trigger backend workflows or API calls—and handles multiple audio segments per conversation turn. System prompts are not yet supported, which limits some fine-tuning approaches for custom behavior.
Voxtral Small excels in three primary areas:
Transcription and Translation: The model achieves state-of-the-art word error rates on FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks. It automatically detects the source language and transcribes or translates accordingly. For practitioners building multilingual transcription pipelines, this eliminates the need for separate language detection and ASR models.
Audio Understanding and Q&A: Unlike cascaded systems that transcribe first then process text, Voxtral Small can answer questions about audio content directly. A developer building a meeting transcription tool can ask the model to summarize action items, extract named entities, or identify speakers without intermediate text processing steps.
Voice-Triggered Function Calling: The model supports function calling from spoken input. This is relevant for voice-controlled applications—smart assistants, workflow automation, or hands-free data entry—where users trigger specific actions through natural speech.
Text-only performance remains strong, retaining the capabilities of Mistral Small 3.1. The model handles coding, reasoning, and general text tasks, though its primary value proposition is audio. For pure text workloads, you'd likely choose a specialized text model.
This section addresses the practical question: can you run this model on hardware you actually own?
| Quantization | VRAM Required | Quality Impact |
|---|---|---|
| FP16 (full) | ~48 GB | Reference quality |
| Q8_0 | ~26 GB | Minimal loss |
| Q4_K_M | ~15 GB | Good for most tasks |
| Q4_0 | ~13 GB | Noticeable degradation |
| Q3_K_M | ~11 GB | Significant quality loss |
RTX 4090 (24 GB): This is the sweet spot. With Q4_K_M quantization, the model fits comfortably with room for context. Expect 10-20 tokens per second on audio transcription tasks, depending on audio length and batch settings. For pure text generation, you'll see higher throughput.
RTX 3090 (24 GB): Same VRAM as the 4090, but slower memory bandwidth. Expect 6-12 tokens per second. Viable for batch processing where latency isn't critical.
M4 Max (48 GB unified memory): Runs Q8_0 quantization comfortably. Performance depends on memory bandwidth—expect 8-15 tokens per second with Apple's Metal backend.
Dual RTX 4090 (48 GB): Enables FP16 inference. Tensor parallelism works with vLLM, giving near-linear scaling. This is the setup for production workloads requiring maximum accuracy.
RTX 4060 (8 GB): Not viable. Even Q3_K_M at 11 GB exceeds VRAM. Offloading to system RAM makes inference impractically slow.
For most users: Q4_K_M quantization on a single 24 GB GPU. This balances quality and performance. Use vllm as the inference engine—it handles the Whisper encoder pipeline efficiently and supports tensor parallelism if you add a second GPU. Ollama support is expected but not confirmed at launch; check the Ollama model library for updates.
vs. Whisper-large-v3 + Mistral Small 3.1 pipeline: The cascaded approach gives you more flexibility—swap out components independently—but introduces latency and error propagation between stages. Voxtral Small's end-to-end architecture produces lower word error rates and supports audio Q&A natively. If you need transcription only, the cascaded pipeline is cheaper to run. If you need understanding, Voxtral wins.
vs. Meta SeamlessM4T (2.3B): SeamlessM4T is smaller, runs on less hardware, and targets translation specifically. Voxtral Small offers broader capabilities—transcription, understanding, function calling—at the cost of higher VRAM requirements. For multilingual translation-only workloads, SeamlessM4T is more efficient. For production applications requiring both transcription and downstream processing, Voxtral Small eliminates the integration complexity.
vs. GPT-4o Audio (API): The closed-source comparison. GPT-4o Audio offers larger context windows and broader language support, but you cannot run it locally. Voxtral Small's Apache 2.0 license means zero per-token costs, full data privacy, and no API dependency. The tradeoff is hardware investment and lower throughput.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Explore the Family
The full Voxtral family leaderboard with sizes, benchmark scores, and a release timeline.