Mistral AI

Mistral Medium 3.5

A dense 128B flagship model from Mistral AI, merging instruction-following, reasoning, and coding capabilities. Supports a 256k context window, vision inputs, and configurable reasoning effort.

128B paramsDense262K ctxText + Vision

View on Hugging Face Official Page

Our Take

Best for: Strongest at real GitHub issue solving (SWE-Verified) in its size class

A workable 128B-parameter dense language model from Mistral AI. Pulls ahead on real GitHub issue solving (SWE-Verified) (78/100), so reach for it when that's the dimension that matters.

Run this onApple M3 Ultra (32-core CPU, 80-core GPU)Cheapest card in our directory with comfortable headroom (512 GB) for this model at Q4 (~343.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters128B

ArchitectureDense

Context Length262K tokens

ModalityText + Vision

ProviderMistral AI

Download Size267.2 GB

Community

Monthly Downloads309.5K

Likes388

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

74.8

SWE-Verified

77.6

HLE

12.8

AA Intelligence Index

29.9

39.6

33.3

68.8

61.0

MBA Open Score

43.9CC

Benchmark40%

49.7

Popularity25%

39.2

Efficiency20%

5.6

Versatility15%

87.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	316.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	343.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	356.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	371.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	403.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	524.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	16.7 tok/s	343.2 GB
Dell Pro Max with GB300Dell	AA	16.7 tok/s	343.2 GB
HP ZGX Fury AI StationHP	AA	16.7 tok/s	343.2 GB
MSI XpertStation WS300MSI	AA	16.7 tok/s	343.2 GB
SuperMicro Super AI StationSuperMicro	AA	16.7 tok/s	343.2 GB
Gigabyte W775-V10-L01Gigabyte	AA	16.7 tok/s	343.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	1.9 tok/s	343.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	1.9 tok/s	343.2 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	1.2 tok/s	343.2 GB
Acer Veriton GN100 AI MiniAcer	FF	0.6 tok/s	343.2 GB
AMD Instinct MI300XAMD	FF	12.4 tok/s	343.2 GB
AMD Instinct MI325XAMD	FF	14.1 tok/s	343.2 GB
AMD Instinct MI355XAMD	FF	18.8 tok/s	343.2 GB
AMD Radeon RX 7600 8GBAMD	FF	0.7 tok/s	343.2 GB
AMD Radeon RX 7700 XTAMD	FF	1.0 tok/s	343.2 GB
AMD Radeon RX 7800 XTAMD	FF	1.5 tok/s	343.2 GB
AMD Radeon RX 7900 XTAMD	FF	1.9 tok/s	343.2 GB
AMD Radeon RX 7900 XTXAMD	FF	2.3 tok/s	343.2 GB
AMD Radeon RX 9070AMD	FF	1.5 tok/s	343.2 GB
AMD Radeon RX 9070 XTAMD	FF	1.5 tok/s	343.2 GB
Apple M4Apple	FF	0.3 tok/s	343.2 GB
Apple M4 Max (40-core GPU)Apple	FF	1.3 tok/s	343.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	343.2 GB
Apple M5Apple	FF	0.4 tok/s	343.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.4 tok/s	343.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral Medium 3.5 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W	$2.78
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 343 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Mistral Medium 3.5 is Mistral AI’s first flagship merged model — a dense 128B-parameter transformer that unifies instruction-following, reasoning, coding, and vision into a single set of weights. Released in April 2026 under a Modified MIT license, it replaces Mistral Medium 3.1 and Magistral in Mistral’s Le Chat product, and Devstral 2 in their Vibe coding agent. The model targets developers who need one capable local model for chat, agentic workflows, code generation, and multimodal analysis, without juggling separate specialized checkpoints.

At 128B parameters, Mistral Medium 3.5 competes directly with other dense models in the ~100B-140B range (e.g., Llama 3.1 123B, Qwen 2.5 72B) and with larger MoE models like Mixtral 8x22B. What sets it apart is its configurable reasoning effort — you can ask for a quick reply or let the model think harder on complex tasks, all from the same model. It scores 77.6% on SWE-Bench Verified and handles a 262,144-token context window, making it a serious option for local coding assistants, long-document analysis, and self-hosted agent pipelines.

Architecture & Technical Details

Mistral Medium 3.5 uses a dense transformer architecture with 128B parameters. Unlike mixture-of-experts (MoE) designs where only a subset of parameters activate per token, this dense model uses all parameters for every forward pass. That means inference speed and VRAM consumption are predictable: you need enough GPU memory to load the full model at a given quantization, and tokens-per-second scales with compute bandwidth.

The context window is 262,144 tokens (effectively 256k). This enables processing entire codebases, full-length books, or multi-turn agentic sessions without truncation. The model also includes a dedicated vision encoder trained from scratch to handle variable image sizes and aspect ratios, accepting images alongside text input.

The architecture supports:

Configurable reasoning effort via a reasoning_effort parameter ("none" or "high"). When set to "high", the model applies additional test-time compute, boosting performance on math, logic, and agentic tasks.
Native function calling and JSON output, essential for agentic workflows.
Multilingual support across 20+ languages including English, French, German, Spanish, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic, Hindi, Bengali, and others.

Recommended sampling parameters from Mistral: temperature 0.7 with reasoning_effort="high", or 0.0–0.7 with reasoning_effort="none"; top_p 0.95 for reasoning, leave as 1.0 otherwise.

Capabilities & Use Cases

Mistral Medium 3.5 excels in four primary areas:

Coding & Agentic Tasks – With a SWE-Bench Verified score of 77.6% (close to Claude Sonnet 4.6’s 79.6%), it can resolve real-world GitHub issues, generate pull requests, and refactor codebases. Its strong function-calling and JSON output support make it suitable for building autonomous coding agents that interact with tools, APIs, and repositories.

Reasoning & Math – Configurable reasoning effort lets you push the model on multi-step problems. For complex agentic runs or math puzzles, setting reasoning_effort="high"" yields better results at the cost of slower generation.

Vision & Multimodal Understanding – The model can analyze images, extract information from diagrams, screenshots, or documents, and answer questions about visual content. This is useful for tasks like UI-to-code conversion, document parsing, or captioning.

General Instruction-Following & Chat – As a unified flagship, it handles open-ended chat, system prompts, and structured instructions reliably. The large context window supports extended conversations and document-grounded Q&A.

Multilingual support covers major European, Asian, and Middle Eastern languages, making it viable for international deployments.

Running Mistral Medium 3.5 Locally

A 128B dense model is demanding — expect substantial hardware requirements. Here’s what you need to know for local inference.

VRAM Requirements

Quantization	Approximate VRAM Needed	Typical GPU Configuration
FP16 (full precision)	~256 GB	4–8 H100/H200 (80GB each) or multi-GPU server
Q8_0 (8-bit)	~135 GB	2–3 H100/H200, or 4 RTX 6000 Ada
Q4_K_M (4-bit)	~72 GB	2 RTX 4090 (24GB each) or 1 H100 (80GB)
Q4_K_S (4-bit small)	~68 GB	2 RTX 4090 or 1 A100 (80GB)
Q3_K_L (3-bit)	~55 GB	1 RTX 4090 (requires offloading some layers to system RAM)

Consumer hardware: The most realistic path is using 4-bit quantization (Q4_K_M) on a pair of RTX 4090 GPUs (48GB total) or a single RTX 6000 Ada (48GB). For a single RTX 4090 (24GB), you’ll need 3-bit quantization and layer offloading to CPU RAM, which significantly slows tokens per second. Apple Silicon users with M4 Max or M4 Ultra (64GB+ unified memory) can run Q4_K_M entirely in memory, but expect lower throughput than NVIDIA.

Performance Expectations

RTX 4090 (2x, Q4_K_M): ~4–8 tokens per second, depending on prompt length and context.
H100 (80GB, Q4_K_M): ~15–25 tokens per second.
Apple M4 Ultra (192GB, Q4_K_M): ~2–5 tokens per second.

For real-time interaction, dual consumer GPUs or a single H100-class card are recommended. Inference engines: Ollama (via llama.cpp), vLLM, and SGLang all support Mistral Medium 3.5. Ollama is the quickest way to get started — run ollama run mistral-medium-3.5 after downloading the Q4_K_M GGUF.

What You’ll Trade Off

Quantization degrades raw capability, especially on reasoning and long-context tasks. If you need peak performance, FP16 on multiple H100s is ideal. For most local practitioners, Q4_K_M offers the best balance of quality and feasibility.

How It Compares

vs. Llama 3.1 123B (dense, Meta): Llama 3.1 is a strong general-purpose model with excellent multilingual support, but lacks vision capabilities and native function calling out of the box. Mistral Medium 3.5 matches or exceeds Llama 3.1 on coding benchmarks and provides vision and configurable reasoning. If you need vision or agentic tool use, Mistral is the better choice. If you prioritize pure text-generation quality and have a strong preference for Meta’s ecosystem, Llama 3.1 remains competitive.

vs. Qwen 2.5 72B (dense, Alibaba): Qwen 2.5 72B is smaller (72B vs 128B) and much easier to run on consumer hardware (single 24GB GPU at 4-bit). It scores ~72% on SWE-Bench Verified — lower but still strong. It also supports vision and function calling. Mistral Medium 3.5 outperforms it on coding and reasoning at the cost of requiring significantly more VRAM. Choose Mistral if you have the hardware and need the extra performance; choose Qwen if you need to stay on a single consumer GPU.

vs. Mixtral 8x22B (MoE, Mistral): Mixtral 8x22B uses ~45B active parameters per token, so it runs faster on modest hardware (~2–3 tokens/s on a single 4090 at Q4). However, Mistral Medium 3.5’s dense 128B architecture delivers higher benchmark scores and native vision, making it the more capable model when you have the GPU budget. Mixtral is a pragmatic fallback for limited setups.

Related Models

Mistral AI

Explore the Provider

See all Mistral AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Open Mistral AI

Explore the Family

See every Mistral release

The full Mistral family leaderboard with sizes, benchmark scores, and a release timeline.

Open Mistral

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Mistral AI

Mistral Medium 3.5

A dense 128B flagship model from Mistral AI, merging instruction-following, reasoning, and coding capabilities. Supports a 256k context window, vision inputs, and configurable reasoning effort.

128B paramsDense262K ctxText + Vision

View on Hugging Face Official Page

Our Take

Best for: Strongest at real GitHub issue solving (SWE-Verified) in its size class

A workable 128B-parameter dense language model from Mistral AI. Pulls ahead on real GitHub issue solving (SWE-Verified) (78/100), so reach for it when that's the dimension that matters.

Run this onApple M3 Ultra (32-core CPU, 80-core GPU)Cheapest card in our directory with comfortable headroom (512 GB) for this model at Q4 (~343.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters128B

ArchitectureDense

Context Length262K tokens

ModalityText + Vision

ProviderMistral AI

Download Size267.2 GB

Community

Monthly Downloads309.5K

Likes388

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

74.8

SWE-Verified

77.6

HLE

12.8

AA Intelligence Index

29.9

39.6

33.3

68.8

61.0

MBA Open Score

43.9CC

Benchmark40%

49.7

Popularity25%

39.2

Efficiency20%

5.6

Versatility15%

87.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	316.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	343.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	356.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	371.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	403.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	524.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	16.7 tok/s	343.2 GB
Dell Pro Max with GB300Dell	AA	16.7 tok/s	343.2 GB
HP ZGX Fury AI StationHP	AA	16.7 tok/s	343.2 GB
MSI XpertStation WS300MSI	AA	16.7 tok/s	343.2 GB
SuperMicro Super AI StationSuperMicro	AA	16.7 tok/s	343.2 GB
Gigabyte W775-V10-L01Gigabyte	AA	16.7 tok/s	343.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	1.9 tok/s	343.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	1.9 tok/s	343.2 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	1.2 tok/s	343.2 GB
Acer Veriton GN100 AI MiniAcer	FF	0.6 tok/s	343.2 GB
AMD Instinct MI300XAMD	FF	12.4 tok/s	343.2 GB
AMD Instinct MI325XAMD	FF	14.1 tok/s	343.2 GB
AMD Instinct MI355XAMD	FF	18.8 tok/s	343.2 GB
AMD Radeon RX 7600 8GBAMD	FF	0.7 tok/s	343.2 GB
AMD Radeon RX 7700 XTAMD	FF	1.0 tok/s	343.2 GB
AMD Radeon RX 7800 XTAMD	FF	1.5 tok/s	343.2 GB
AMD Radeon RX 7900 XTAMD	FF	1.9 tok/s	343.2 GB
AMD Radeon RX 7900 XTXAMD	FF	2.3 tok/s	343.2 GB
AMD Radeon RX 9070AMD	FF	1.5 tok/s	343.2 GB
AMD Radeon RX 9070 XTAMD	FF	1.5 tok/s	343.2 GB
Apple M4Apple	FF	0.3 tok/s	343.2 GB
Apple M4 Max (40-core GPU)Apple	FF	1.3 tok/s	343.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	343.2 GB
Apple M5Apple	FF	0.4 tok/s	343.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.4 tok/s	343.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Mistral Medium 3.5 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W	$2.78
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 343 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Architecture & Technical Details

The architecture supports:

Configurable reasoning effort via a reasoning_effort parameter ("none" or "high"). When set to "high", the model applies additional test-time compute, boosting performance on math, logic, and agentic tasks.
Native function calling and JSON output, essential for agentic workflows.
Multilingual support across 20+ languages including English, French, German, Spanish, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic, Hindi, Bengali, and others.

Recommended sampling parameters from Mistral: temperature 0.7 with reasoning_effort="high", or 0.0–0.7 with reasoning_effort="none"; top_p 0.95 for reasoning, leave as 1.0 otherwise.

Capabilities & Use Cases

Mistral Medium 3.5 excels in four primary areas:

Multilingual support covers major European, Asian, and Middle Eastern languages, making it viable for international deployments.

Running Mistral Medium 3.5 Locally

A 128B dense model is demanding — expect substantial hardware requirements. Here’s what you need to know for local inference.

VRAM Requirements

Quantization	Approximate VRAM Needed	Typical GPU Configuration
FP16 (full precision)	~256 GB	4–8 H100/H200 (80GB each) or multi-GPU server
Q8_0 (8-bit)	~135 GB	2–3 H100/H200, or 4 RTX 6000 Ada
Q4_K_M (4-bit)	~72 GB	2 RTX 4090 (24GB each) or 1 H100 (80GB)
Q4_K_S (4-bit small)	~68 GB	2 RTX 4090 or 1 A100 (80GB)
Q3_K_L (3-bit)	~55 GB	1 RTX 4090 (requires offloading some layers to system RAM)

Performance Expectations

RTX 4090 (2x, Q4_K_M): ~4–8 tokens per second, depending on prompt length and context.
H100 (80GB, Q4_K_M): ~15–25 tokens per second.
Apple M4 Ultra (192GB, Q4_K_M): ~2–5 tokens per second.