made by agents

Mistral's groundbreaking sparse MoE model with 46.7B total / 12.9B active parameters. Set the standard for efficient MoE architecture in open models. Apache 2.0.
Copy and paste this command to start running the model locally.
ollama run mixtral:8x7bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 8.7 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 11.4 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 12.7 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 14.2 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 17.4 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 29.7 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 44.2 tok/s | 11.4 GB | ||
| SS | 56.7 tok/s | 11.4 GB | ||
| SS | 45.3 tok/s | 11.4 GB | ||
| SS | 45.3 tok/s | 11.4 GB | ||
Google Cloud TPU v5eGoogle | SS | 58.0 tok/s | 11.4 GB | |
| SS | 47.6 tok/s | 11.4 GB | ||
| SS | 52.1 tok/s | 11.4 GB | ||
| SS | 63.5 tok/s | 11.4 GB | ||
| SS | 68.0 tok/s | 11.4 GB | ||
Intel Arc A770 16GBIntel | SS | 39.7 tok/s | 11.4 GB | |
| SS | 68.0 tok/s | 11.4 GB | ||
| SS | 71.4 tok/s | 11.4 GB | ||
| SS | 116.2 tok/s | 11.4 GB | ||
| SS | 126.9 tok/s | 11.4 GB | ||
NVIDIA L40SNVIDIA | SS | 61.2 tok/s | 11.4 GB | |
| SS | 68.0 tok/s | 11.4 GB | ||
| SS | 31.7 tok/s | 11.4 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 144.4 tok/s | 11.4 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 237.3 tok/s | 11.4 GB | |
Google Cloud TPU v5pGoogle | SS | 195.9 tok/s | 11.4 GB | |
| SS | 173.6 tok/s | 11.4 GB | ||
| SS | 43.5 tok/s | 11.4 GB | ||
| SS | 56.7 tok/s | 11.4 GB | ||
| SS | 262.1 tok/s | 11.4 GB | ||
| SS | 43.5 tok/s | 11.4 GB |
Mixtral 8x7B Instruct is a high-performance Sparse Mixture-of-Experts (SMoE) language model that redefined the efficiency standards for open-weight AI. Released by Mistral AI under the Apache 2.0 license, it features 46.7B total parameters but only utilizes 12.9B active parameters during inference. This architectural choice allows the model to outperform much larger dense models, such as Llama 2 70B, while maintaining the inference speed and latency characteristics of a significantly smaller model.
For practitioners looking to run Mixtral 8x7B Instruct locally, this model represents a specific tier of hardware commitment. It occupies the middle ground between consumer-grade 7B/13B models and massive 70B+ dense models. Its primary strengths lie in its robust instruction-following capabilities, high-quality multilingual support (including French, German, Spanish, and Italian), and a generous 32,768-token context window. Whether you are building a local RAG (Retrieval-Augmented Generation) pipeline or a private coding assistant, Mixtral 8x7B Instruct remains a top-tier choice for a local AI model with 46.7B parameters in 2025.
The defining characteristic of Mixtral 8x7B Instruct is its Sparse Mixture-of-Experts architecture. Unlike dense models where every parameter is activated for every token generated, Mixtral utilizes a router mechanism to select two "experts" (sub-networks) out of eight for each token. This results in 46.7B total parameters existing in memory, but only 12.9B parameters being "active" for the actual computation.
The Mixtral 8x7B Instruct MoE efficiency provides a dual-edged sword for local deployment. Because the model has 46.7B total parameters, your hardware must have enough VRAM to store the entire weights set. However, because only 12.9B parameters are active per token, the Mixtral 8x7B Instruct tokens per second (t/s) will be much higher than a dense 46B or 70B model. Essentially, you pay the "VRAM tax" of a large model but reap the "speed benefits" of a medium model.
Mixtral 8x7B Instruct is a versatile general-purpose model, but its specific tuning makes it particularly effective for technical and structured tasks.
The model demonstrates high proficiency in code generation, debugging, and explanation. It handles Python, JavaScript, C++, and Rust with a level of logic that rivals larger proprietary models. Because of its 32k context window, you can feed it multiple files or entire documentation sets to provide context for complex refactoring tasks.
On the Mixtral 8x7B Instruct reasoning benchmark, the model consistently scores near the top of its weight class. It excels at multi-step logic problems and following complex, system-prompt-constrained instructions. This makes it an ideal engine for local agents that need to parse JSON, call tools, or maintain a specific persona without drifting.
While many open models are English-centric, Mistral AI optimized this model for European languages. It maintains high linguistic nuance in French, German, Spanish, and Italian, making it the preferred choice for localized applications where Llama-based models might struggle with grammar or cultural context.
To run Mixtral 8x7B Instruct locally, the primary bottleneck is VRAM. Because the model must be loaded entirely into memory to avoid the massive performance hit of system RAM offloading, your GPU configuration is critical.
The best GPU for Mixtral 8x7B Instruct depends on your budget and desired precision.
Q4_K_M or Q5_K_M, with room left for a large KV cache (context).Quantization reduces the bit-depth of the model weights to save space. For Mixtral, we recommend the following:
| Quantization | VRAM Required | Performance Impact | Recommended Use Case |
| :--- | :--- | :--- | :--- |
| Q2_K | ~15.5 GB | Significant | Only if limited to 16GB VRAM |
| Q3_K_M | ~20.2 GB | Moderate | Single 24GB GPU (RTX 4090/3090) |
| Q4_K_M | ~26.4 GB | Minimal | Dual GPU or Mac (64GB+ RAM) |
| Q5_K_M | ~31.2 GB | Negligible | Dual GPU or Mac (64GB+ RAM) |
| Q8_0 | ~49.5 GB | None | Professional Workstations (A6000/Dual A100) |
On a dual RTX 3090 setup using llama.cpp or ExLlamaV2, you can expect between 20 and 40 tokens per second at 4-bit quantization. On Apple M3 Max hardware, performance typically ranges from 15 to 25 tokens per second. If you are looking for the how to run 46.7B model on consumer GPU answer, the quickest and most user-friendly method is via Ollama. Simply run ollama run mixtral to pull a 4-bit quantized version that automatically manages memory allocation.
When evaluating Mixtral 8x7B Instruct, it is most often compared to Llama 3 70B and Command R.
Llama 3 70B is a newer, dense model. In terms of raw "intelligence" and world knowledge, Llama 3 70B often takes the lead in benchmarks. However, Mixtral 8x7B Instruct is significantly faster to run locally because of its 12.9B active parameters. If your workflow requires high throughput (e.g., processing hundreds of documents), Mixtral is the more efficient choice. If you need the absolute highest reasoning capability and have the 40GB+ of VRAM to spare for a quantized 70B model, Llama 3 is the alternative.
Command R is specifically optimized for RAG and long-context tasks. While Mixtral has a 32k context window, Command R supports up to 128k. However, Mixtral 8x7B Instruct generally performs better as a general-purpose chat and coding assistant. Mixtral’s Apache 2.0 license is also more permissive than the licenses attached to many versions of Command R, making it the safer choice for commercial local deployments.
You should choose Mixtral 8x7B Instruct if you have between 24GB and 48GB of VRAM and need a model that feels "snappy" during interactive chat. It remains the gold standard for MoE architecture, offering a level of balance between memory footprint, inference speed, and reasoning depth that few models have matched since its release.