Mistral AI

Voxtral Small 24B (2507)

Mistral AI's flagship 24B-parameter open-weights audio-language model built on Mistral Small 3.1 with a Whisper-derived encoder, delivering state-of-the-art transcription, translation, and audio understanding in 8+ languages.

24B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters24B

ArchitectureDense

ProviderMistral AI

Download Size97.1 GB

Community

Monthly Downloads27.1K

Likes486

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.6%

Overall Score

57.2BB

Benchmark40%

86.8

Popularity25%

62.0

Efficiency25%

0.0

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	15.2 GB
AMD Instinct MI300XAMD	SS	15.2 GB
AMD Instinct MI325XAMD	SS	15.2 GB
AMD Instinct MI355XAMD	SS	15.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	15.2 GB
Apple M4Apple	SS	15.2 GB
Apple M4 Max (40-core GPU)Apple	SS	15.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	15.2 GB
Apple M5Apple	SS	15.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	15.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	15.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	15.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	15.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	15.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	15.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	15.2 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	15.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	15.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	SS	15.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	SS	15.2 GB
ASUS Ascent GX10 - 1TBASUS	SS	15.2 GB
ASUS Ascent GX10 - 2TBASUS	SS	15.2 GB
ASUS Ascent GX10 - 4TBASUS	SS	15.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	15.2 GB
Dell Pro Max with GB10Dell	SS	15.2 GB

Rows per page

Page 1 of 4

About This Model

Voxtral Small 24B-2507

Architecture: Multimodal audio-language model pairing a Whisper-style audio encoder with the Mistral Small 3.1 (24B) decoder LLM via a multi-modal projector. Implemented as VoxtralForConditionalGeneration in Hugging Face Transformers; recommended deployment via vLLM with --tensor-parallel-size 2 (≈55 GB GPU RAM in bf16/fp16).

Modality: Audio + text in / text out. 32k-token context window, handling audios up to ~30 min for transcription and ~40 min for audio understanding.

Capabilities:

State-of-the-art transcription that beats Whisper large-v3 across all tested FLEURS languages and matches/exceeds GPT-4o mini Transcribe, Gemini 2.5 Flash and ElevenLabs Scribe on Mozilla Common Voice and English short-form.
Audio Q&A and structured summarization end-to-end without chaining a separate ASR.
Speech translation — state-of-the-art on FLEURS-Translation.
Function calling from voice with --tool-call-parser mistral --enable-auto-tool-choice.
Natively multilingual: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian (and more), with automatic language detection.
Retains the full text intelligence of Mistral Small 3.1 (matches its MMLU/DROP curves).

Training & release: Released July 15, 2025 alongside Voxtral Mini (paper arXiv 2507.13264). Distributed under Apache 2.0; available on Mistral's La Plateforme, Le Chat voice mode, and via private deployment.

Use cases: Production-scale transcription, multilingual voice agents, meeting/call analysis, podcast and media indexing, voice-driven workflow automation in regulated environments.

Related Models

Mistral AI

Voxtral Mini 3B (2507)

3BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

24B

Mistral AI

Voxtral Small 24B (2507)

24B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters24B

ArchitectureDense

ProviderMistral AI

Download Size97.1 GB

Community

Monthly Downloads27.1K

Likes486

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.6%

Overall Score

57.2BB

Benchmark40%

86.8

Popularity25%

62.0

Efficiency25%

0.0

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	15.2 GB
AMD Instinct MI300XAMD	SS	15.2 GB
AMD Instinct MI325XAMD	SS	15.2 GB
AMD Instinct MI355XAMD	SS	15.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	15.2 GB
Apple M4Apple	SS	15.2 GB
Apple M4 Max (40-core GPU)Apple	SS	15.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	15.2 GB
Apple M5Apple	SS	15.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	15.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	15.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	15.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	15.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	15.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	15.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	15.2 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	15.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	15.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	SS	15.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	SS	15.2 GB
ASUS Ascent GX10 - 1TBASUS	SS	15.2 GB
ASUS Ascent GX10 - 2TBASUS	SS	15.2 GB
ASUS Ascent GX10 - 4TBASUS	SS	15.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	15.2 GB
Dell Pro Max with GB10Dell	SS	15.2 GB

Rows per page

Page 1 of 4

About This Model

Voxtral Small 24B-2507

Modality: Audio + text in / text out. 32k-token context window, handling audios up to ~30 min for transcription and ~40 min for audio understanding.

Capabilities:

State-of-the-art transcription that beats Whisper large-v3 across all tested FLEURS languages and matches/exceeds GPT-4o mini Transcribe, Gemini 2.5 Flash and ElevenLabs Scribe on Mozilla Common Voice and English short-form.
Audio Q&A and structured summarization end-to-end without chaining a separate ASR.
Speech translation — state-of-the-art on FLEURS-Translation.
Function calling from voice with --tool-call-parser mistral --enable-auto-tool-choice.
Natively multilingual: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian (and more), with automatic language detection.
Retains the full text intelligence of Mistral Small 3.1 (matches its MMLU/DROP curves).

Use cases: Production-scale transcription, multilingual voice agents, meeting/call analysis, podcast and media indexing, voice-driven workflow automation in regulated environments.

Related Models

Mistral AI

Voxtral Mini 3B (2507)

3BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.