made by agents

Mistral's compact 24B dense model. Fits on a single RTX 4090 when quantized. Strong agentic capabilities with native function calling. Apache 2.0 licensed.
Copy and paste this command to start running the model locally.
ollama run mistral-small:24bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 34.0 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 39.0 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 41.4 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 44.3 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 50.3 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 73.1 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA A100 SXM4 80GBNVIDIA | SS | 42.1 tok/s | 39.0 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 69.2 tok/s | 39.0 GB | |
Google Cloud TPU v5pGoogle | SS | 57.1 tok/s | 39.0 GB | |
| SS | 50.6 tok/s | 39.0 GB | ||
| SS | 76.4 tok/s | 39.0 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 99.1 tok/s | 39.0 GB | |
| SS | 109.4 tok/s | 39.0 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 165.1 tok/s | 39.0 GB | |
| SS | 123.9 tok/s | 39.0 GB | ||
| SS | 165.1 tok/s | 39.0 GB | ||
| AA | 16.5 tok/s | 39.0 GB | ||
| BB | 19.8 tok/s | 39.0 GB | ||
| BB | 16.5 tok/s | 39.0 GB | ||
| BB | 8.3 tok/s | 39.0 GB | ||
NVIDIA L40SNVIDIA | BB | 17.8 tok/s | 39.0 GB | |
| BB | 12.7 tok/s | 39.0 GB | ||
| BB | 12.7 tok/s | 39.0 GB | ||
| BB | 12.7 tok/s | 39.0 GB | ||
| BB | 6.3 tok/s | 39.0 GB | ||
| BB | 11.3 tok/s | 39.0 GB | ||
| BB | 11.3 tok/s | 39.0 GB | ||
| BB | 11.3 tok/s | 39.0 GB | ||
| BB | 11.3 tok/s | 39.0 GB | ||
| BB | 8.3 tok/s | 39.0 GB | ||
| BB | 5.6 tok/s | 39.0 GB |
Mistral Small 3 (24B) represents the current sweet spot for local LLM practitioners. Released by Mistral AI under the Apache 2.0 license, this model fills the massive performance gap between 7B/8B "small" models and 70B+ "large" models. It is a dense 24-billion parameter model designed specifically for agentic workflows, native function calling, and complex reasoning tasks that typically require much larger compute footprints.
For developers looking to run Mistral Small 3 24B locally, the primary draw is its efficiency. It provides a significant intelligence uplift over Llama 3.1 8B or Mistral 7B while remaining small enough to reside entirely on consumer-grade hardware. It is positioned as a direct competitor to models like Gemma 2 27B and Qwen 2.5 32B, offering a balanced architecture that prioritizes logic and instruction-following over raw parameter count.
Mistral Small 3 24B utilizes a standard dense transformer architecture. Unlike Mixture-of-Experts (MoE) models where only a fraction of parameters are active during inference, this 24B model engages all parameters for every token generated. While this means higher compute requirements per token compared to an MoE of the same active size, it results in superior state-retention and reasoning stability within its weight class.
The model features a massive 128,000 token context window, making it viable for long-document analysis, complex codebase ingestion, and multi-turn agentic loops. Trained on data up to October 2023, the model's knowledge base is relatively modern. Because it is a dense model, VRAM consumption is predictable: the weights themselves take up the bulk of the memory, with the KV cache scaling linearly with the context depth used.
The "Small" moniker is deceptive; Mistral Small 3 24B is purpose-built for "agentic" tasks. Its performance in function-calling and JSON output reliability makes it one of the best choices for local RAG (Retrieval-Augmented Generation) pipelines.
The Mistral Small 3 24B reasoning benchmark results show it consistently punching above its weight, particularly in multi-step problem solving. It is effective for tasks that require "thinking" before responding, such as logistical planning or complex data extraction where the model must navigate contradictory information within the context.
Mistral Small 3 24B for coding is a standout use case. It handles Python, JavaScript, C++, and Rust with high proficiency. Because of its 128k context, you can feed it multiple files from a local repository to perform refactoring or bug hunting. It follows system prompts with higher fidelity than 8B models, making it less likely to "hallucinate" library syntax that doesn't exist.
The model is natively multilingual, supporting English, French, German, Spanish, Italian, Chinese, and several other languages. Its instruction-following capabilities are tuned for precision; if you provide a strict schema for a function call, the model adheres to it without the "conversational fluff" that plagues less-optimized models.
The Mistral Small 3 24B hardware requirements are the most critical factor for local deployment. This model was seemingly designed with the 24GB VRAM buffer in mind—the exact capacity of the NVIDIA RTX 3090 and 4090.
To run 24B model on consumer GPU hardware, quantization is mandatory. A 24B model in full 16-bit precision (FP16) would require ~48GB of VRAM, necessitating dual-GPU setups or enterprise cards. However, most practitioners will use 4-bit or 8-bit versions.
The best GPU for Mistral Small 3 24B is the NVIDIA RTX 4090. With 24GB of VRAM, you can run a Q4_K_M or Q5_K_M quantization while maintaining a 32k+ token context window entirely in VRAM.
For Mac users, an M2/M3/M4 Max with at least 32GB of Unified Memory provides an excellent experience. While the Mistral Small 3 24B tokens per second will be lower on Apple Silicon than on an RTX 4090, the ability to scale to the full 128k context window using 64GB+ of RAM is a significant advantage.
The fastest way to test this model is via Ollama. Once installed, run:
ollama run mistral-small:24b
This will pull the default 4-bit quantized version, which is optimized for most modern setups.
When evaluating this as your local AI model 24B parameters 2025 choice, it is helpful to compare it against its closest rivals in the "medium-weight" category.
The Llama 3.1 8B is faster and runs on almost any modern laptop. However, Mistral Small 24B is significantly more "intelligent." It is less prone to repetition, better at following complex logic, and far superior at function calling. If your hardware can handle 16GB+ of VRAM, the 24B model is a definitive upgrade.
Google's Gemma 2 27B is a formidable competitor. Gemma 2 often scores slightly higher on creative writing and general knowledge benchmarks. However, Mistral Small 3 24B generally wins on Mistral Small 3 24B performance in technical tasks, coding, and tool use. Additionally, Mistral's Apache 2.0 license is more permissive for developers than the Gemma terms of use.
Qwen 2.5 32B is larger and arguably the king of benchmarks in this size class. However, the 32B size makes it much harder to fit on a single 24GB GPU once you account for the KV cache. Mistral Small 3 24B is "right-sized" for the 4090, allowing for higher throughput and longer context than the 32B Qwen on the same hardware.