Mistral AI

Mistral 7B Instruct

Mistral AI's first model. 7B dense model that outperformed Llama 2 13B on all benchmarks at release. Uses sliding window attention. Apache 2.0 licensed.

7B paramsDense33K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Local inference under 16 GB VRAM

A solid 7B-parameter dense language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~6.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Multilingual

Instruction Following

Model Specifications

Parameters7B

ArchitectureDense

Context Length33K tokens

ModalityText Only

Training Cutoff2023

ProviderMistral AI

Download Size29.0 GB

Community

Monthly Downloads4.5M

Likes2.6K

Last Updated5 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mistral:7b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

38.4

Overall Score

59.1BB

Benchmark40%

38.4

Popularity25%

86.7

Efficiency20%

78.7

Versatility15%

42.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	16.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	54.4 tok/s	6.4 GB
Intel Arc B580Intel	SS	57.4 tok/s	6.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	63.4 tok/s	6.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	63.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	56.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	84.6 tok/s	6.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	64.5 tok/s	6.4 GB
AMD Radeon RX 7800 XTAMD	SS	78.5 tok/s	6.4 GB
AMD Radeon RX 9070AMD	SS	80.6 tok/s	6.4 GB
AMD Radeon RX 9070 XTAMD	SS	80.6 tok/s	6.4 GB
Google Cloud TPU v5eGoogle	SS	103.1 tok/s	6.4 GB
Intel Arc A770 16GBIntel	SS	70.5 tok/s	6.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	120.8 tok/s	6.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	84.6 tok/s	6.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	92.6 tok/s	6.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	56.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	112.8 tok/s	6.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	120.8 tok/s	6.4 GB
AMD Radeon RX 7600 8GBAMD	SS	36.3 tok/s	6.4 GB
AMD Radeon RX 7900 XTAMD	AA	100.7 tok/s	6.4 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	34.2 tok/s	6.4 GB
AMD Radeon RX 7900 XTXAMD	AA	120.8 tok/s	6.4 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	117.8 tok/s	6.4 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	36.3 tok/s	6.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	126.9 tok/s	6.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~36 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral 7B Instruct on AMD Radeon RX 7600 8GB · ~36 tok/s · 165W	$0.152
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Mistral 7B Instruct is the foundational model from Mistral AI that redefined the performance expectations for small-scale language models. Released under the Apache 2.0 license, it is a 7-billion parameter dense model designed to punch significantly above its weight class, outperforming Llama 2 13B across every major benchmark at its release. For practitioners looking to run Mistral 7B Instruct locally, it represents the "gold standard" for the 7B parameter class, offering a high-efficiency balance of reasoning, coding proficiency, and instruction-following.

While newer models have entered the space, Mistral 7B Instruct remains a staple for local deployment due to its mature ecosystem support and predictable hardware footprint. It is the ideal candidate for developers who need a reliable, low-latency model for edge devices, personal workstations, or private servers where VRAM is a finite resource.

Architecture & Technical Details

The Mistral 7B Instruct performance is driven by a standard transformer architecture enhanced by two specific technical optimizations: Grouped-Query Attention (GQA) and Sliding Window Attention (SWA).

GQA significantly speeds up inference by reducing the memory overhead of the KV cache, which is critical when maintaining high Mistral 7B Instruct tokens per second during long-form generation. SWA allows the model to handle a theoretical context length of 32,768 tokens by attending only to a fixed number of previous tokens in each layer. This reduces the computational complexity from $O(n^2)$ to $O(n)$, making it possible to process larger documents without the exponential increase in compute power typically required by dense models.

The model is "dense," meaning all 7 billion parameters are active during every forward pass. Unlike Mixture of Experts (MoE) architectures, which may have higher total parameters but lower active counts, Mistral 7B provides consistent, predictable throughput on consumer-grade silicon.

Capabilities & Use Cases

Mistral 7B Instruct is fine-tuned specifically for conversational AI and task execution. Its instruction-following capabilities make it a versatile tool for local automation and private data processing.

Mistral 7B Instruct for Coding

Despite its small size, the model is surprisingly capable of generating, debugging, and explaining code. It excels at Python, JavaScript, and C++ tasks. Practitioners often use it as a local backend for VS Code extensions (like Continue or Tabby) to avoid sending proprietary code to cloud APIs. It is particularly effective for generating boilerplate, writing unit tests, and refactoring small functions.

Multilingual Support and Reasoning

The model handles multiple languages with high proficiency, making it suitable for local translation tasks or multilingual sentiment analysis. Its 32k context window allows for "needle-in-a-haystack" retrieval tasks, such as summarizing long technical manuals or analyzing multi-page legal documents, provided the hardware can support the KV cache requirements for those lengths.

Structured Data Extraction

Because it follows instructions precisely, it is frequently used to transform unstructured text into JSON or Markdown. This makes it a core component in local RAG (Retrieval-Augmented Generation) pipelines, where it serves as the reasoning engine that synthesizes retrieved context into a final answer.

Running Mistral 7B Instruct Locally

To successfully run Mistral 7B Instruct locally, the primary bottleneck is VRAM. The model's footprint depends entirely on the quantization level used. Quantization reduces the precision of the model weights (e.g., from 16-bit to 4-bit), significantly lowering the Mistral 7B Instruct hardware requirements with minimal loss in perplexity.

Mistral 7B Instruct VRAM Requirements

FP16 (Unquantized): ~15 GB VRAM. Requires a 16GB+ GPU (RTX 4080 16GB, RTX 3090, or RTX 4090).
Q8_0 (8-bit): ~8.5 GB VRAM. Fits on 10GB or 12GB cards (RTX 3060 12GB, RTX 4070).
Q4_K_M (4-bit): ~5.5 GB VRAM. This is the best quantization for Mistral 7B Instruct for most users, as it fits comfortably on 8GB consumer GPUs (RTX 3070, RTX 4060) while retaining nearly all the intelligence of the base model.

Hardware Recommendations

NVIDIA Users: The best GPU for Mistral 7B Instruct is the RTX 4060 Ti (16GB) or the RTX 4070 (12GB) for a balance of speed and price. On an RTX 4090, you can expect speeds exceeding 100 tokens per second.
Apple Silicon Users: On an M2 or M3 Max, the model runs near-instantaneously. Even a base M1 MacBook Air with 16GB of RAM can run the 4-bit quantized version at usable speeds (approx. 15-20 tokens per second).
CPU-only: If you lack a dedicated GPU, you can run the model using GGUF format via llama.cpp, though performance will drop to 3-8 tokens per second depending on your RAM speed.

How to Start

The quickest way to get started is via Ollama. After installing Ollama, you can run the model with a single command:

ollama run mistral

This automatically handles the quantization and memory mapping, ensuring the model fits your specific hardware profile.

How It Compares

Mistral 7B Instruct vs Llama 3 8B

Llama 3 8B is a newer release (2024) and generally performs better on logic and creative writing benchmarks. However, Mistral 7B often feels more "concise" and is less prone to the heavy-handed safety guardrails found in Meta's models. Mistral 7B's 32k context window also outperforms Llama 3's base 8k context in scenarios involving long document processing.

Mistral 7B Instruct vs Gemma 7B

Google's Gemma 7B is another strong competitor. While Gemma performs well on mathematical reasoning, Mistral 7B Instruct remains the more popular choice for general-purpose local deployment due to its superior efficiency in GGUF/EXL2 formats and its more permissive Apache 2.0 license compared to Gemma’s custom terms.

As a local AI model 7B parameters 2025 remains relevant, Mistral 7B Instruct is the baseline against which all other small models are measured. It is the "workhorse" model—reliable, fast, and capable of running on almost any modern consumer machine.

Related Models

Mistral AI

Explore the Provider

See all Mistral AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Open Mistral AI

Explore the Family

See every Mistral release

The full Mistral family leaderboard with sizes, benchmark scores, and a release timeline.

Open Mistral

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Mistral AI

Mistral 7B Instruct

Mistral AI's first model. 7B dense model that outperformed Llama 2 13B on all benchmarks at release. Uses sliding window attention. Apache 2.0 licensed.

7B paramsDense33K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Local inference under 16 GB VRAM

A solid 7B-parameter dense language model from Mistral AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~6.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Multilingual

Instruction Following

Model Specifications

Parameters7B

ArchitectureDense

Context Length33K tokens

ModalityText Only

Training Cutoff2023

ProviderMistral AI

Download Size29.0 GB

Community

Monthly Downloads4.5M

Likes2.6K

Last Updated5 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run mistral:7b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

38.4

Overall Score

59.1BB

Benchmark40%

38.4

Popularity25%

86.7

Efficiency20%

78.7

Versatility15%

42.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	16.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	54.4 tok/s	6.4 GB
Intel Arc B580Intel	SS	57.4 tok/s	6.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	63.4 tok/s	6.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	63.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	56.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	84.6 tok/s	6.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	64.5 tok/s	6.4 GB
AMD Radeon RX 7800 XTAMD	SS	78.5 tok/s	6.4 GB
AMD Radeon RX 9070AMD	SS	80.6 tok/s	6.4 GB
AMD Radeon RX 9070 XTAMD	SS	80.6 tok/s	6.4 GB
Google Cloud TPU v5eGoogle	SS	103.1 tok/s	6.4 GB
Intel Arc A770 16GBIntel	SS	70.5 tok/s	6.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	120.8 tok/s	6.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	84.6 tok/s	6.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	92.6 tok/s	6.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	56.4 tok/s	6.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	112.8 tok/s	6.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	120.8 tok/s	6.4 GB
AMD Radeon RX 7600 8GBAMD	SS	36.3 tok/s	6.4 GB
AMD Radeon RX 7900 XTAMD	AA	100.7 tok/s	6.4 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	34.2 tok/s	6.4 GB
AMD Radeon RX 7900 XTXAMD	AA	120.8 tok/s	6.4 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	117.8 tok/s	6.4 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	36.3 tok/s	6.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	126.9 tok/s	6.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~36 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral 7B Instruct on AMD Radeon RX 7600 8GB · ~36 tok/s · 165W	$0.152
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Mistral 7B Instruct for Coding

Multilingual Support and Reasoning

Structured Data Extraction

Running Mistral 7B Instruct Locally

Mistral 7B Instruct VRAM Requirements

FP16 (Unquantized): ~15 GB VRAM. Requires a 16GB+ GPU (RTX 4080 16GB, RTX 3090, or RTX 4090).
Q8_0 (8-bit): ~8.5 GB VRAM. Fits on 10GB or 12GB cards (RTX 3060 12GB, RTX 4070).
Q4_K_M (4-bit): ~5.5 GB VRAM. This is the best quantization for Mistral 7B Instruct for most users, as it fits comfortably on 8GB consumer GPUs (RTX 3070, RTX 4060) while retaining nearly all the intelligence of the base model.

Hardware Recommendations

NVIDIA Users: The best GPU for Mistral 7B Instruct is the RTX 4060 Ti (16GB) or the RTX 4070 (12GB) for a balance of speed and price. On an RTX 4090, you can expect speeds exceeding 100 tokens per second.
Apple Silicon Users: On an M2 or M3 Max, the model runs near-instantaneously. Even a base M1 MacBook Air with 16GB of RAM can run the 4-bit quantized version at usable speeds (approx. 15-20 tokens per second).
CPU-only: If you lack a dedicated GPU, you can run the model using GGUF format via llama.cpp, though performance will drop to 3-8 tokens per second depending on your RAM speed.