Mistral AI

Mixtral 8x7B Instruct

Mistral's groundbreaking sparse MoE model with 46.7B total / 12.9B active parameters. Set the standard for efficient MoE architecture in open models. Apache 2.0.

46.7B paramsMoE33K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 46.7B-parameter MoE language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~11.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Multilingual

Math

Instruction Following

Model Specifications

Parameters46.7B

Active Params12.9B

ArchitectureMoE

Context Length33K tokens

ModalityText Only

Training Cutoff2023

ProviderMistral AI

Download Size190.5 GB

Community

Monthly Downloads593.0K

Likes4.7K

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mixtral:8x7b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

44.7

Overall Score

57.2BB

Benchmark40%

44.7

Popularity25%

66.6

Efficiency20%

73.8

Versatility15%

53.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	8.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	11.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	12.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	14.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	17.4 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	29.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	44.2 tok/s	11.4 GB
AMD Radeon RX 7900 XTAMD	SS	56.7 tok/s	11.4 GB
AMD Radeon RX 9070AMD	SS	45.3 tok/s	11.4 GB
AMD Radeon RX 9070 XTAMD	SS	45.3 tok/s	11.4 GB
Google Cloud TPU v5eGoogle	SS	58.0 tok/s	11.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	47.6 tok/s	11.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	52.1 tok/s	11.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	63.5 tok/s	11.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	68.0 tok/s	11.4 GB
Intel Arc A770 16GBIntel	SS	39.7 tok/s	11.4 GB
AMD Radeon RX 7900 XTXAMD	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	66.3 tok/s	11.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	71.4 tok/s	11.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	36.3 tok/s	11.4 GB
Google Cloud TPU v6e (Trillium)Google	SS	116.2 tok/s	11.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	126.9 tok/s	11.4 GB
Origin PC M-CLASS v2Origin PC	SS	126.9 tok/s	11.4 GB
NVIDIA L40SNVIDIA	SS	61.2 tok/s	11.4 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	68.0 tok/s	11.4 GB
Origin PC L-CLASS v2Origin PC	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	31.7 tok/s	11.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	144.4 tok/s	11.4 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	237.3 tok/s	11.4 GB
Google Cloud TPU v5pGoogle	SS	195.9 tok/s	11.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~31 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mixtral 8x7B Instruct on AMD Radeon RX 7700 XT · ~31 tok/s · 245W	$0.267
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 11 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Mixtral 8x7B Instruct is a high-performance Sparse Mixture-of-Experts (SMoE) language model that redefined the efficiency standards for open-weight AI. Released by Mistral AI under the Apache 2.0 license, it features 46.7B total parameters but only utilizes 12.9B active parameters during inference. This architectural choice allows the model to outperform much larger dense models, such as Llama 2 70B, while maintaining the inference speed and latency characteristics of a significantly smaller model.

For practitioners looking to run Mixtral 8x7B Instruct locally, this model represents a specific tier of hardware commitment. It occupies the middle ground between consumer-grade 7B/13B models and massive 70B+ dense models. Its primary strengths lie in its robust instruction-following capabilities, high-quality multilingual support (including French, German, Spanish, and Italian), and a generous 32,768-token context window. Whether you are building a local RAG (Retrieval-Augmented Generation) pipeline or a private coding assistant, Mixtral 8x7B Instruct remains a top-tier choice for a local AI model with 46.7B parameters in 2025.

Architecture and Technical Details

The defining characteristic of Mixtral 8x7B Instruct is its Sparse Mixture-of-Experts architecture. Unlike dense models where every parameter is activated for every token generated, Mixtral utilizes a router mechanism to select two "experts" (sub-networks) out of eight for each token. This results in 46.7B total parameters existing in memory, but only 12.9B parameters being "active" for the actual computation.

MoE Efficiency and VRAM Impact

The Mixtral 8x7B Instruct MoE efficiency provides a dual-edged sword for local deployment. Because the model has 46.7B total parameters, your hardware must have enough VRAM to store the entire weights set. However, because only 12.9B parameters are active per token, the Mixtral 8x7B Instruct tokens per second (t/s) will be much higher than a dense 46B or 70B model. Essentially, you pay the "VRAM tax" of a large model but reap the "speed benefits" of a medium model.

Total Parameters: 46.7B
Active Parameters: 12.9B
Expert Configuration: 8 experts, 2 active per token
Context Window: 32,768 tokens
Attention Mechanism: Grouped-Query Attention (GQA) for faster inference and reduced memory footprint during context processing.

Capabilities and Use Cases

Mixtral 8x7B Instruct is a versatile general-purpose model, but its specific tuning makes it particularly effective for technical and structured tasks.

Mixtral 8x7B Instruct for Coding

The model demonstrates high proficiency in code generation, debugging, and explanation. It handles Python, JavaScript, C++, and Rust with a level of logic that rivals larger proprietary models. Because of its 32k context window, you can feed it multiple files or entire documentation sets to provide context for complex refactoring tasks.

Reasoning and Instruction Following

On the Mixtral 8x7B Instruct reasoning benchmark, the model consistently scores near the top of its weight class. It excels at multi-step logic problems and following complex, system-prompt-constrained instructions. This makes it an ideal engine for local agents that need to parse JSON, call tools, or maintain a specific persona without drifting.

Multilingual Applications

While many open models are English-centric, Mistral AI optimized this model for European languages. It maintains high linguistic nuance in French, German, Spanish, and Italian, making it the preferred choice for localized applications where Llama-based models might struggle with grammar or cultural context.

Running Mixtral 8x7B Instruct Locally

To run Mixtral 8x7B Instruct locally, the primary bottleneck is VRAM. Because the model must be loaded entirely into memory to avoid the massive performance hit of system RAM offloading, your GPU configuration is critical.

Mixtral 8x7B Instruct Hardware Requirements

The best GPU for Mixtral 8x7B Instruct depends on your budget and desired precision.

Single GPU Setup: A single NVIDIA RTX 4090 (24GB VRAM) is insufficient for the full model at standard 4-bit quantization. You would need to use a high-compression 3-bit quantization or offload several layers to system RAM, which significantly degrades tokens per second.
Dual GPU Setup: Two RTX 3090s or 4090s (48GB total VRAM) is the "sweet spot." This allows you to run the best quantization for Mixtral 8x7B Instruct, which is typically Q4_K_M or Q5_K_M, with room left for a large KV cache (context).
Mac Studio/MacBook Pro: Apple Silicon is excellent for this model. An M2/M3/M4 Max or Ultra with at least 64GB of Unified Memory can run the model at high precision with excellent efficiency.

VRAM Requirements by Quantization

Quantization reduces the bit-depth of the model weights to save space. For Mixtral, we recommend the following:

Quantization	VRAM Required	Performance Impact	Recommended Use Case
Q2_K	~15.5 GB	Significant	Only if limited to 16GB VRAM
Q3_K_M	~20.2 GB	Moderate	Single 24GB GPU (RTX 4090/3090)
Q4_K_M	~26.4 GB	Minimal	Dual GPU or Mac (64GB+ RAM)
Q5_K_M	~31.2 GB	Negligible	Dual GPU or Mac (64GB+ RAM)
Q8_0	~49.5 GB	None	Professional Workstations (A6000/Dual A100)

Performance Expectations

On a dual RTX 3090 setup using llama.cpp or ExLlamaV2, you can expect between 20 and 40 tokens per second at 4-bit quantization. On Apple M3 Max hardware, performance typically ranges from 15 to 25 tokens per second. If you are looking for the how to run 46.7B model on consumer GPU answer, the quickest and most user-friendly method is via Ollama. Simply run ollama run mixtral to pull a 4-bit quantized version that automatically manages memory allocation.

Mixtral 8x7B Instruct vs. Competitors

When evaluating Mixtral 8x7B Instruct, it is most often compared to Llama 3 70B and Command R.

Mixtral 8x7B Instruct vs. Llama 3 70B

Llama 3 70B is a newer, dense model. In terms of raw "intelligence" and world knowledge, Llama 3 70B often takes the lead in benchmarks. However, Mixtral 8x7B Instruct is significantly faster to run locally because of its 12.9B active parameters. If your workflow requires high throughput (e.g., processing hundreds of documents), Mixtral is the more efficient choice. If you need the absolute highest reasoning capability and have the 40GB+ of VRAM to spare for a quantized 70B model, Llama 3 is the alternative.

Mixtral 8x7B Instruct vs. Command R

Command R is specifically optimized for RAG and long-context tasks. While Mixtral has a 32k context window, Command R supports up to 128k. However, Mixtral 8x7B Instruct generally performs better as a general-purpose chat and coding assistant. Mixtral’s Apache 2.0 license is also more permissive than the licenses attached to many versions of Command R, making it the safer choice for commercial local deployments.

Why Choose Mixtral 8x7B Instruct?

You should choose Mixtral 8x7B Instruct if you have between 24GB and 48GB of VRAM and need a model that feels "snappy" during interactive chat. It remains the gold standard for MoE architecture, offering a level of balance between memory footprint, inference speed, and reasoning depth that few models have matched since its release.

Related Models

Mistral AI

Explore the Provider

See all Mistral AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Open Mistral AI

Explore the Family

See every Mistral release

The full Mistral family leaderboard with sizes, benchmark scores, and a release timeline.

Open Mistral

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Mistral AI

Mixtral 8x7B Instruct

Mistral's groundbreaking sparse MoE model with 46.7B total / 12.9B active parameters. Set the standard for efficient MoE architecture in open models. Apache 2.0.

46.7B paramsMoE33K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 46.7B-parameter MoE language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~11.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Multilingual

Math

Instruction Following

Model Specifications

Parameters46.7B

Active Params12.9B

ArchitectureMoE

Context Length33K tokens

ModalityText Only

Training Cutoff2023

ProviderMistral AI

Download Size190.5 GB

Community

Monthly Downloads593.0K

Likes4.7K

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mixtral:8x7b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

44.7

Overall Score

57.2BB

Benchmark40%

44.7

Popularity25%

66.6

Efficiency20%

73.8

Versatility15%

53.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	8.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	11.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	12.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	14.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	17.4 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	29.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	44.2 tok/s	11.4 GB
AMD Radeon RX 7900 XTAMD	SS	56.7 tok/s	11.4 GB
AMD Radeon RX 9070AMD	SS	45.3 tok/s	11.4 GB
AMD Radeon RX 9070 XTAMD	SS	45.3 tok/s	11.4 GB
Google Cloud TPU v5eGoogle	SS	58.0 tok/s	11.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	47.6 tok/s	11.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	52.1 tok/s	11.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	63.5 tok/s	11.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	68.0 tok/s	11.4 GB
Intel Arc A770 16GBIntel	SS	39.7 tok/s	11.4 GB
AMD Radeon RX 7900 XTXAMD	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	66.3 tok/s	11.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	71.4 tok/s	11.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	36.3 tok/s	11.4 GB
Google Cloud TPU v6e (Trillium)Google	SS	116.2 tok/s	11.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	126.9 tok/s	11.4 GB
Origin PC M-CLASS v2Origin PC	SS	126.9 tok/s	11.4 GB
NVIDIA L40SNVIDIA	SS	61.2 tok/s	11.4 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	68.0 tok/s	11.4 GB
Origin PC L-CLASS v2Origin PC	SS	68.0 tok/s	11.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	31.7 tok/s	11.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	144.4 tok/s	11.4 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	237.3 tok/s	11.4 GB
Google Cloud TPU v5pGoogle	SS	195.9 tok/s	11.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~31 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mixtral 8x7B Instruct on AMD Radeon RX 7700 XT · ~31 tok/s · 245W	$0.267
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 11 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture and Technical Details

MoE Efficiency and VRAM Impact

Total Parameters: 46.7B
Active Parameters: 12.9B
Expert Configuration: 8 experts, 2 active per token
Context Window: 32,768 tokens
Attention Mechanism: Grouped-Query Attention (GQA) for faster inference and reduced memory footprint during context processing.

Capabilities and Use Cases

Mixtral 8x7B Instruct is a versatile general-purpose model, but its specific tuning makes it particularly effective for technical and structured tasks.

Mixtral 8x7B Instruct for Coding

Reasoning and Instruction Following

Multilingual Applications

Running Mixtral 8x7B Instruct Locally

Mixtral 8x7B Instruct Hardware Requirements

The best GPU for Mixtral 8x7B Instruct depends on your budget and desired precision.

Single GPU Setup: A single NVIDIA RTX 4090 (24GB VRAM) is insufficient for the full model at standard 4-bit quantization. You would need to use a high-compression 3-bit quantization or offload several layers to system RAM, which significantly degrades tokens per second.
Dual GPU Setup: Two RTX 3090s or 4090s (48GB total VRAM) is the "sweet spot." This allows you to run the best quantization for Mixtral 8x7B Instruct, which is typically Q4_K_M or Q5_K_M, with room left for a large KV cache (context).
Mac Studio/MacBook Pro: Apple Silicon is excellent for this model. An M2/M3/M4 Max or Ultra with at least 64GB of Unified Memory can run the model at high precision with excellent efficiency.

VRAM Requirements by Quantization

Quantization reduces the bit-depth of the model weights to save space. For Mixtral, we recommend the following:

Quantization	VRAM Required	Performance Impact	Recommended Use Case
Q2_K	~15.5 GB	Significant	Only if limited to 16GB VRAM
Q3_K_M	~20.2 GB	Moderate	Single 24GB GPU (RTX 4090/3090)
Q4_K_M	~26.4 GB	Minimal	Dual GPU or Mac (64GB+ RAM)
Q5_K_M	~31.2 GB	Negligible	Dual GPU or Mac (64GB+ RAM)
Q8_0	~49.5 GB	None	Professional Workstations (A6000/Dual A100)