Mistral AI

Mistral Small 3 24B

Mistral's compact 24B dense model. Fits on a single RTX 4090 when quantized. Strong agentic capabilities with native function calling. Apache 2.0 licensed.

24B paramsDense128K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A workable 24B-parameter dense language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~39.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Instruction Following

Model Specifications

Parameters24B

ArchitectureDense

Context Length128K tokens

ModalityText Only

Training CutoffOctober 2023

ProviderMistral AI

Download Size94.3 GB

Community

Monthly Downloads76.0K

Likes958

Last Updated9 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mistral-small:24b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

60.0

Overall Score

52.7CC

Benchmark40%

60.0

Popularity25%

32.6

Efficiency20%

59.0

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	34.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	39.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	41.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	44.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	50.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	73.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	42.1 tok/s	39.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	69.2 tok/s	39.0 GB
Google Cloud TPU v5pGoogle	SS	57.1 tok/s	39.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	50.6 tok/s	39.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	76.4 tok/s	39.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	99.1 tok/s	39.0 GB
AMD Instinct MI300XAMD	SS	109.4 tok/s	39.0 GB
Google TPU v7 (Ironwood)Google	SS	152.3 tok/s	39.0 GB
NVIDIA B200 GPUNVIDIA	SS	165.1 tok/s	39.0 GB
AMD Instinct MI325XAMD	SS	123.9 tok/s	39.0 GB
AMD Instinct MI355XAMD	SS	165.1 tok/s	39.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	146.6 tok/s	39.0 GB
Dell Pro Max with GB300Dell	SS	146.6 tok/s	39.0 GB
HP ZGX Fury AI StationHP	SS	146.6 tok/s	39.0 GB
MSI XpertStation WS300MSI	SS	146.6 tok/s	39.0 GB
SuperMicro Super AI StationSuperMicro	SS	146.6 tok/s	39.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	146.6 tok/s	39.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	16.5 tok/s	39.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	BB	19.8 tok/s	39.0 GB
Origin PC L-CLASS v2Origin PC	BB	19.8 tok/s	39.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	16.5 tok/s	39.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	8.3 tok/s	39.0 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	10.6 tok/s	39.0 GB
NVIDIA L40SNVIDIA	BB	17.8 tok/s	39.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	12.7 tok/s	39.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~5.3 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral Small 3 24B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~5.3 tok/s · 150W	$0.946
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 39 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Mistral Small 3 (24B) represents the current sweet spot for local LLM practitioners. Released by Mistral AI under the Apache 2.0 license, this model fills the massive performance gap between 7B/8B "small" models and 70B+ "large" models. It is a dense 24-billion parameter model designed specifically for agentic workflows, native function calling, and complex reasoning tasks that typically require much larger compute footprints.

For developers looking to run Mistral Small 3 24B locally, the primary draw is its efficiency. It provides a significant intelligence uplift over Llama 3.1 8B or Mistral 7B while remaining small enough to reside entirely on consumer-grade hardware. It is positioned as a direct competitor to models like Gemma 2 27B and Qwen 2.5 32B, offering a balanced architecture that prioritizes logic and instruction-following over raw parameter count.

Architecture & Technical Details

Mistral Small 3 24B utilizes a standard dense transformer architecture. Unlike Mixture-of-Experts (MoE) models where only a fraction of parameters are active during inference, this 24B model engages all parameters for every token generated. While this means higher compute requirements per token compared to an MoE of the same active size, it results in superior state-retention and reasoning stability within its weight class.

The model features a massive 128,000 token context window, making it viable for long-document analysis, complex codebase ingestion, and multi-turn agentic loops. Trained on data up to October 2023, the model's knowledge base is relatively modern. Because it is a dense model, VRAM consumption is predictable: the weights themselves take up the bulk of the memory, with the KV cache scaling linearly with the context depth used.

Capabilities & Use Cases

The "Small" moniker is deceptive; Mistral Small 3 24B is purpose-built for "agentic" tasks. Its performance in function-calling and JSON output reliability makes it one of the best choices for local RAG (Retrieval-Augmented Generation) pipelines.

Advanced Reasoning and Logic

The Mistral Small 3 24B reasoning benchmark results show it consistently punching above its weight, particularly in multi-step problem solving. It is effective for tasks that require "thinking" before responding, such as logistical planning or complex data extraction where the model must navigate contradictory information within the context.

Local Coding Assistant

Mistral Small 3 24B for coding is a standout use case. It handles Python, JavaScript, C++, and Rust with high proficiency. Because of its 128k context, you can feed it multiple files from a local repository to perform refactoring or bug hunting. It follows system prompts with higher fidelity than 8B models, making it less likely to "hallucinate" library syntax that doesn't exist.

Multilingual and Instruction Following

The model is natively multilingual, supporting English, French, German, Spanish, Italian, Chinese, and several other languages. Its instruction-following capabilities are tuned for precision; if you provide a strict schema for a function call, the model adheres to it without the "conversational fluff" that plagues less-optimized models.

Running Mistral Small 3 24B Locally

The Mistral Small 3 24B hardware requirements are the most critical factor for local deployment. This model was seemingly designed with the 24GB VRAM buffer in mind—the exact capacity of the NVIDIA RTX 3090 and 4090.

VRAM Requirements and Quantization

To run 24B model on consumer GPU hardware, quantization is mandatory. A 24B model in full 16-bit precision (FP16) would require ~48GB of VRAM, necessitating dual-GPU setups or enterprise cards. However, most practitioners will use 4-bit or 8-bit versions.

Q4_K_M (4-bit): Requires ~15-16GB VRAM. This is the best quantization for Mistral Small 3 24B as it offers the best trade-off between perplexity loss and memory savings. It fits comfortably on an RTX 3090/4090 with plenty of room for a large KV cache.
Q8_0 (8-bit): Requires ~26GB VRAM. This will not fit on a single consumer card and requires either dual GPUs or a Mac Studio/MacBook Pro with Unified Memory.
Q2_K / Q3_K: Can fit on 12GB cards (like the RTX 4070 Ti Super), but the "intelligence" degradation is noticeable.

Recommended Hardware

The best GPU for Mistral Small 3 24B is the NVIDIA RTX 4090. With 24GB of VRAM, you can run a Q4_K_M or Q5_K_M quantization while maintaining a 32k+ token context window entirely in VRAM.

For Mac users, an M2/M3/M4 Max with at least 32GB of Unified Memory provides an excellent experience. While the Mistral Small 3 24B tokens per second will be lower on Apple Silicon than on an RTX 4090, the ability to scale to the full 128k context window using 64GB+ of RAM is a significant advantage.

Quick Start with Ollama

The fastest way to test this model is via Ollama. Once installed, run:

ollama run mistral-small:24b

This will pull the default 4-bit quantized version, which is optimized for most modern setups.

How It Compares

When evaluating this as your local AI model 24B parameters 2025 choice, it is helpful to compare it against its closest rivals in the "medium-weight" category.

Mistral Small 3 24B vs Llama 3.1 8B

The Llama 3.1 8B is faster and runs on almost any modern laptop. However, Mistral Small 24B is significantly more "intelligent." It is less prone to repetition, better at following complex logic, and far superior at function calling. If your hardware can handle 16GB+ of VRAM, the 24B model is a definitive upgrade.

Mistral Small 3 24B vs Gemma 2 27B

Google's Gemma 2 27B is a formidable competitor. Gemma 2 often scores slightly higher on creative writing and general knowledge benchmarks. However, Mistral Small 3 24B generally wins on Mistral Small 3 24B performance in technical tasks, coding, and tool use. Additionally, Mistral's Apache 2.0 license is more permissive for developers than the Gemma terms of use.

Mistral Small 3 24B vs Qwen 2.5 32B

Qwen 2.5 32B is larger and arguably the king of benchmarks in this size class. However, the 32B size makes it much harder to fit on a single 24GB GPU once you account for the KV cache. Mistral Small 3 24B is "right-sized" for the 4090, allowing for higher throughput and longer context than the 32B Qwen on the same hardware.

Related Models

Mistral AI

Explore the Provider

See all Mistral AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Open Mistral AI

Explore the Family

See every Mistral release

The full Mistral family leaderboard with sizes, benchmark scores, and a release timeline.

Open Mistral

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Mistral AI

Mistral Small 3 24B

Mistral's compact 24B dense model. Fits on a single RTX 4090 when quantized. Strong agentic capabilities with native function calling. Apache 2.0 licensed.

24B paramsDense128K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A workable 24B-parameter dense language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~39.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Instruction Following

Model Specifications

Parameters24B

ArchitectureDense

Context Length128K tokens

ModalityText Only

Training CutoffOctober 2023

ProviderMistral AI

Download Size94.3 GB

Community

Monthly Downloads76.0K

Likes958

Last Updated9 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mistral-small:24b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

60.0

Overall Score

52.7CC

Benchmark40%

60.0

Popularity25%

32.6

Efficiency20%

59.0

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	34.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	39.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	41.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	44.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	50.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	73.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	42.1 tok/s	39.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	69.2 tok/s	39.0 GB
Google Cloud TPU v5pGoogle	SS	57.1 tok/s	39.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	50.6 tok/s	39.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	76.4 tok/s	39.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	99.1 tok/s	39.0 GB
AMD Instinct MI300XAMD	SS	109.4 tok/s	39.0 GB
Google TPU v7 (Ironwood)Google	SS	152.3 tok/s	39.0 GB
NVIDIA B200 GPUNVIDIA	SS	165.1 tok/s	39.0 GB
AMD Instinct MI325XAMD	SS	123.9 tok/s	39.0 GB
AMD Instinct MI355XAMD	SS	165.1 tok/s	39.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	146.6 tok/s	39.0 GB
Dell Pro Max with GB300Dell	SS	146.6 tok/s	39.0 GB
HP ZGX Fury AI StationHP	SS	146.6 tok/s	39.0 GB
MSI XpertStation WS300MSI	SS	146.6 tok/s	39.0 GB
SuperMicro Super AI StationSuperMicro	SS	146.6 tok/s	39.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	146.6 tok/s	39.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	16.5 tok/s	39.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	BB	19.8 tok/s	39.0 GB
Origin PC L-CLASS v2Origin PC	BB	19.8 tok/s	39.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	16.5 tok/s	39.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	8.3 tok/s	39.0 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	10.6 tok/s	39.0 GB
NVIDIA L40SNVIDIA	BB	17.8 tok/s	39.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	12.7 tok/s	39.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~5.3 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral Small 3 24B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~5.3 tok/s · 150W	$0.946
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 39 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Advanced Reasoning and Logic

Local Coding Assistant

Multilingual and Instruction Following

Running Mistral Small 3 24B Locally

VRAM Requirements and Quantization

Q4_K_M (4-bit): Requires ~15-16GB VRAM. This is the best quantization for Mistral Small 3 24B as it offers the best trade-off between perplexity loss and memory savings. It fits comfortably on an RTX 3090/4090 with plenty of room for a large KV cache.
Q8_0 (8-bit): Requires ~26GB VRAM. This will not fit on a single consumer card and requires either dual GPUs or a Mac Studio/MacBook Pro with Unified Memory.
Q2_K / Q3_K: Can fit on 12GB cards (like the RTX 4070 Ti Super), but the "intelligence" degradation is noticeable.

Recommended Hardware

The best GPU for Mistral Small 3 24B is the NVIDIA RTX 4090. With 24GB of VRAM, you can run a Q4_K_M or Q5_K_M quantization while maintaining a 32k+ token context window entirely in VRAM.

Quick Start with Ollama

The fastest way to test this model is via Ollama. Once installed, run:

ollama run mistral-small:24b

This will pull the default 4-bit quantized version, which is optimized for most modern setups.

How It Compares

When evaluating this as your local AI model 24B parameters 2025 choice, it is helpful to compare it against its closest rivals in the "medium-weight" category.