
Mistral AI's compact 3B-parameter open-weights audio-language model built on Ministral-3B with a Whisper-derived encoder, designed for transcription, audio Q&A, summarization, and function-calling from voice across 8+ languages.
A solid 3B-parameter dense audio model from Mistral AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Mistral AI’s Voxtral Mini 3B (2507) is a compact open-weights audio-language model that grafts Whisper-derived audio understanding onto the Ministral-3B text backbone. At 3 billion dense parameters, it’s built for practitioners who need local speech transcription, audio Q&A, summarization, and voice-driven function calling—without sending audio data to the cloud. Released under Apache 2.0, it targets edge devices, consumer GPUs, and any environment where privacy, latency, or cost rules out API-based solutions.
Voxtral Mini competes in a narrow but growing niche: small models that process spoken language natively rather than relying on a separate ASR pipeline. Unlike cascading a Whisper model with a small LLM, Voxtral fuses both capabilities into one forward pass. This makes it a practical choice for real-time transcription, voice-controlled agents, and embedded systems where every millisecond and megabyte counts.
The model’s parameter count (3B) and dense architecture mean it fits on hardware that can’t run larger 7B+ models. It is not a data-center model—it’s meant to run on your workstation, laptop, or even a Raspberry Pi with the right quantization.
Voxtral Mini is built on Ministral-3B, a dense 3B-parameter language model, with a Whisper-derived audio encoder added at the input stage. This is not a mixture-of-experts (MoE) architecture; all 3B parameters are active during inference. That translates to predictable memory usage and consistent throughput—no routing overhead, no load balancing issues.
The audio encoder converts raw speech into embeddings that the language model processes alongside text tokens. The model accepts audio files up to 40 minutes long (thanks to a 32k token context window), which is sufficient for full meeting transcriptions, lecture recordings, or long-form dictation. The context window applies to combined audio and text inputs, so you can interleave questions with audio chunks in a single conversation.
For local inference, the dense architecture means you need to allocate VRAM for the full parameter set. A 3B model in 16-bit uses roughly 6 GB of VRAM (3B × 2 bytes). Quantized versions (4-bit or 8-bit) reduce that further, making the model runnable on 8 GB GPU cards and even some NPUs. Mistral recommends vLLM >= 0.10.0 for production deployments, with temperature=0.2 and top_p=0.95 for chat/understanding tasks, and temperature=0.0 for pure transcription.
The model supports multiple audio files per message and multi-turn audio conversation. System prompts are not yet supported.
Voxtral Mini achieves state-of-the-art word error rates across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks. It automatically detects the source language among eight supported languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) and transcribes without manual language hints. A dedicated transcription mode optimizes output for pure speech-to-text.
You can feed the model a long audio clip and ask questions about its content—no separate ASR step needed. The model directly reasons over the speech signal. This enables workflows like extracting key points from a recorded meeting, querying a podcast for specific facts, or summarizing a lecture. Benchmarks show it performs comparably to larger closed models like GPT-4o mini and Gemini 2.5 Flash on speech QA and summarization.
Voxtral Mini can infer intent directly from spoken commands—map them to backend functions, API calls, or workflow triggers. For example, a voice assistant running locally could hear “set a timer for 10 minutes” and invoke the set_timer(600) function without text intermediaries. This is useful for smart home automation, field service tools, and hands-free interfaces.
The Ministral-3B backbone remains intact: Voxtral Mini handles standard text tasks (classification, extraction, reasoning) with the same competence as the original text-only model. You don’t sacrifice text capabilities for audio.
Quickest path: Use Ollama with the mistralai/voxtral-mini-3b-2507 model (once officially supported), or run directly via vLLM:
1vllm serve mistralai/Voxtral-Mini-3B-2507 --trust-remote-code
| Quantization | Approx. VRAM | Minimum GPU | Recommended GPU |
|---|---|---|---|
| FP16 | 6.0 GB | 8 GB | RTX 4060 Ti 16GB / M4 Max |
| Q8_0 | 3.5 GB | 4 GB | RTX 3060 12GB |
| Q4_K_M | 2.6 GB | 4 GB | RTX 3050 / M4 Pro |
| Q4_0 | 2.2 GB | 3 GB | Raspberry Pi 5 (via MLX) |
Recommended quantization: Q4_K_M for most users on consumer GPUs. It preserves most benchmark accuracy while keeping VRAM under 3 GB, leaving room for audio audio data and context. On high-end GPUs like RTX 4090 or M4 Max, run FP16 or Q8_0 for maximum fidelity.
These numbers are for text generation with a cold cache. Audio encoding adds some overhead (≈0.5–1 second per minute of audio), but the model stays interactive for most applications.
Choose Voxtral Mini when you want a single, optimized model for voice-first applications. Choose a pipeline if you need to update the ASR or LLM independently.
Choose Voxtral Mini for edge deployments and tasks centered on speech transcription, summarization, and function calling. Choose Qwen2-Audio for broader audio understanding (non-speech sounds, affect detection) when hardware allows.
Voxtral Mini is the model to beat among sub-4B models for local voice AI. It’s not a general-purpose text model; it’s a purpose-built audio-language tool that happens to also be a capable text model. If your workload is purely text, consider Ministral-3B or Phi-3.5 instead.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Explore the Family
The full Voxtral family leaderboard with sizes, benchmark scores, and a release timeline.