A dense 128B flagship model from Mistral AI, merging instruction-following, reasoning, and coding capabilities. Supports a 256k context window, vision inputs, and configurable reasoning effort.
A workable 128B-parameter dense language model from Mistral AI. Pulls ahead on real GitHub issue solving (SWE-Verified) (78/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 316.3 GB | Low | |
| Q4_K_MRecommended | 343.2 GB | Good | |
| Q5_K_M | 356.0 GB | Very Good | |
| Q6_K | 371.3 GB | Excellent | |
| Q8_0 | 403.3 GB | Near Perfect | |
| FP16 | 524.9 GB | Full |
See which devices can run this model and at what quality level.
| AA | 16.7 tok/s | 343.2 GB | ||
| AA | 16.7 tok/s | 343.2 GB | ||
| AA | 16.7 tok/s | 343.2 GB | ||
| AA | 16.7 tok/s | 343.2 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 16.7 tok/s | 343.2 GB | |
Gigabyte W775-V10-L01Gigabyte | AA | 16.7 tok/s | 343.2 GB | |
| BB | 1.9 tok/s | 343.2 GB | ||
| BB | 1.9 tok/s | 343.2 GB | ||
| FF | 1.2 tok/s | 343.2 GB | ||
| FF | 0.6 tok/s | 343.2 GB | ||
| FF | 12.4 tok/s | 343.2 GB | ||
| FF | 14.1 tok/s | 343.2 GB | ||
| FF | 18.8 tok/s | 343.2 GB | ||
| FF | 0.7 tok/s | 343.2 GB | ||
| FF | 1.0 tok/s | 343.2 GB | ||
| FF | 1.5 tok/s | 343.2 GB | ||
| FF | 1.9 tok/s | 343.2 GB | ||
| FF | 2.3 tok/s | 343.2 GB | ||
| FF | 1.5 tok/s | 343.2 GB | ||
| FF | 1.5 tok/s | 343.2 GB | ||
Apple M4Apple | FF | 0.3 tok/s | 343.2 GB | |
| FF | 1.3 tok/s | 343.2 GB | ||
| FF | 0.6 tok/s | 343.2 GB | ||
Apple M5Apple | FF | 0.4 tok/s | 343.2 GB | |
| FF | 1.4 tok/s | 343.2 GB |
Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Mistral Medium 3.5 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W | $2.78 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 343 GB VRAM, refreshed hourly.
No current rental listing covers this model’s VRAM requirement on the providers we track.
Mistral Medium 3.5 is Mistral AI’s first flagship merged model — a dense 128B-parameter transformer that unifies instruction-following, reasoning, coding, and vision into a single set of weights. Released in April 2026 under a Modified MIT license, it replaces Mistral Medium 3.1 and Magistral in Mistral’s Le Chat product, and Devstral 2 in their Vibe coding agent. The model targets developers who need one capable local model for chat, agentic workflows, code generation, and multimodal analysis, without juggling separate specialized checkpoints.
At 128B parameters, Mistral Medium 3.5 competes directly with other dense models in the ~100B-140B range (e.g., Llama 3.1 123B, Qwen 2.5 72B) and with larger MoE models like Mixtral 8x22B. What sets it apart is its configurable reasoning effort — you can ask for a quick reply or let the model think harder on complex tasks, all from the same model. It scores 77.6% on SWE-Bench Verified and handles a 262,144-token context window, making it a serious option for local coding assistants, long-document analysis, and self-hosted agent pipelines.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Mistral AI model we track.

Explore the Family
The full Mistral family leaderboard with sizes, benchmark scores, and a release timeline.
Mistral Medium 3.5 uses a dense transformer architecture with 128B parameters. Unlike mixture-of-experts (MoE) designs where only a subset of parameters activate per token, this dense model uses all parameters for every forward pass. That means inference speed and VRAM consumption are predictable: you need enough GPU memory to load the full model at a given quantization, and tokens-per-second scales with compute bandwidth.
The context window is 262,144 tokens (effectively 256k). This enables processing entire codebases, full-length books, or multi-turn agentic sessions without truncation. The model also includes a dedicated vision encoder trained from scratch to handle variable image sizes and aspect ratios, accepting images alongside text input.
The architecture supports:
reasoning_effort parameter ("none" or "high"). When set to "high", the model applies additional test-time compute, boosting performance on math, logic, and agentic tasks.Recommended sampling parameters from Mistral: temperature 0.7 with reasoning_effort="high", or 0.0–0.7 with reasoning_effort="none"; top_p 0.95 for reasoning, leave as 1.0 otherwise.
Mistral Medium 3.5 excels in four primary areas:
Coding & Agentic Tasks – With a SWE-Bench Verified score of 77.6% (close to Claude Sonnet 4.6’s 79.6%), it can resolve real-world GitHub issues, generate pull requests, and refactor codebases. Its strong function-calling and JSON output support make it suitable for building autonomous coding agents that interact with tools, APIs, and repositories.
Reasoning & Math – Configurable reasoning effort lets you push the model on multi-step problems. For complex agentic runs or math puzzles, setting reasoning_effort="high"" yields better results at the cost of slower generation.
Vision & Multimodal Understanding – The model can analyze images, extract information from diagrams, screenshots, or documents, and answer questions about visual content. This is useful for tasks like UI-to-code conversion, document parsing, or captioning.
General Instruction-Following & Chat – As a unified flagship, it handles open-ended chat, system prompts, and structured instructions reliably. The large context window supports extended conversations and document-grounded Q&A.
Multilingual support covers major European, Asian, and Middle Eastern languages, making it viable for international deployments.
A 128B dense model is demanding — expect substantial hardware requirements. Here’s what you need to know for local inference.
| Quantization | Approximate VRAM Needed | Typical GPU Configuration |
|---|---|---|
| FP16 (full precision) | ~256 GB | 4–8 H100/H200 (80GB each) or multi-GPU server |
| Q8_0 (8-bit) | ~135 GB | 2–3 H100/H200, or 4 RTX 6000 Ada |
| Q4_K_M (4-bit) | ~72 GB | 2 RTX 4090 (24GB each) or 1 H100 (80GB) |
| Q4_K_S (4-bit small) | ~68 GB | 2 RTX 4090 or 1 A100 (80GB) |
| Q3_K_L (3-bit) | ~55 GB | 1 RTX 4090 (requires offloading some layers to system RAM) |
Consumer hardware: The most realistic path is using 4-bit quantization (Q4_K_M) on a pair of RTX 4090 GPUs (48GB total) or a single RTX 6000 Ada (48GB). For a single RTX 4090 (24GB), you’ll need 3-bit quantization and layer offloading to CPU RAM, which significantly slows tokens per second. Apple Silicon users with M4 Max or M4 Ultra (64GB+ unified memory) can run Q4_K_M entirely in memory, but expect lower throughput than NVIDIA.
For real-time interaction, dual consumer GPUs or a single H100-class card are recommended. Inference engines: Ollama (via llama.cpp), vLLM, and SGLang all support Mistral Medium 3.5. Ollama is the quickest way to get started — run ollama run mistral-medium-3.5 after downloading the Q4_K_M GGUF.
Quantization degrades raw capability, especially on reasoning and long-context tasks. If you need peak performance, FP16 on multiple H100s is ideal. For most local practitioners, Q4_K_M offers the best balance of quality and feasibility.
vs. Llama 3.1 123B (dense, Meta): Llama 3.1 is a strong general-purpose model with excellent multilingual support, but lacks vision capabilities and native function calling out of the box. Mistral Medium 3.5 matches or exceeds Llama 3.1 on coding benchmarks and provides vision and configurable reasoning. If you need vision or agentic tool use, Mistral is the better choice. If you prioritize pure text-generation quality and have a strong preference for Meta’s ecosystem, Llama 3.1 remains competitive.
vs. Qwen 2.5 72B (dense, Alibaba): Qwen 2.5 72B is smaller (72B vs 128B) and much easier to run on consumer hardware (single 24GB GPU at 4-bit). It scores ~72% on SWE-Bench Verified — lower but still strong. It also supports vision and function calling. Mistral Medium 3.5 outperforms it on coding and reasoning at the cost of requiring significantly more VRAM. Choose Mistral if you have the hardware and need the extra performance; choose Qwen if you need to stay on a single consumer GPU.
vs. Mixtral 8x22B (MoE, Mistral): Mixtral 8x22B uses ~45B active parameters per token, so it runs faster on modest hardware (~2–3 tokens/s on a single 4090 at Q4). However, Mistral Medium 3.5’s dense 128B architecture delivers higher benchmark scores and native vision, making it the more capable model when you have the GPU budget. Mixtral is a pragmatic fallback for limited setups.
