made by agents

Mistral AI's 123B dense flagship with state-of-the-art reasoning, 80+ coding languages, and multilingual capabilities. 128K context.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 172.0 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 197.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 210.1 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 224.9 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 255.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 372.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 32.6 tok/s | 197.8 GB | ||
| AA | 24.4 tok/s | 197.8 GB | ||
| BB | 3.3 tok/s | 197.8 GB | ||
| BB | 3.3 tok/s | 197.8 GB | ||
NVIDIA B200 GPUNVIDIA | BB | 32.6 tok/s | 197.8 GB | |
| CC | 21.6 tok/s | 197.8 GB | ||
| DD | 3.3 tok/s | 197.8 GB | ||
| FF | 1.2 tok/s | 197.8 GB | ||
| FF | 1.8 tok/s | 197.8 GB | ||
| FF | 2.5 tok/s | 197.8 GB | ||
| FF | 3.3 tok/s | 197.8 GB | ||
| FF | 3.9 tok/s | 197.8 GB | ||
| FF | 2.6 tok/s | 197.8 GB | ||
| FF | 2.6 tok/s | 197.8 GB | ||
Apple M4Apple | FF | 0.5 tok/s | 197.8 GB | |
| FF | 2.2 tok/s | 197.8 GB | ||
| FF | 1.1 tok/s | 197.8 GB | ||
Apple M5Apple | FF | 0.6 tok/s | 197.8 GB | |
| FF | 2.5 tok/s | 197.8 GB | ||
| FF | 1.2 tok/s | 197.8 GB | ||
| FF | 0.3 tok/s | 197.8 GB | ||
| FF | 0.4 tok/s | 197.8 GB | ||
| FF | 0.8 tok/s | 197.8 GB | ||
| FF | 0.5 tok/s | 197.8 GB | ||
| FF | 1.1 tok/s | 197.8 GB |
Mistral Large 2 is Mistral AI’s flagship 123B dense parameter model, designed to compete directly with frontier models like GPT-4o and Llama 3.1 405B. Released as a significant upgrade over the original Mistral Large, this model represents a specific strategic choice in architecture: a 123B dense configuration. This size is intentionally calibrated to provide state-of-the-art reasoning and coding performance while remaining accessible to practitioners who have moved beyond single-GPU setups but cannot support the massive infrastructure required for 400B+ parameter models.
For local deployment, Mistral Large 2 fills the gap between the ubiquitous 70B models and the massive, often unwieldy, ultra-large-scale models. It excels in complex reasoning tasks, high-tier mathematical problem solving, and agentic workflows that require reliable function-calling. Because it is a dense model rather than a Mixture of Experts (MoE), every parameter is active during inference, providing a level of "intellectual density" that often translates to more stable performance in nuanced, multi-step instruction following compared to smaller or sparse models.
The architecture of Mistral Large 2 is a standard Transformer-based dense decoder-only model with 123 billion parameters. Unlike the Mixtral series (which uses MoE), Mistral Large 2 processes every token through all 123B parameters. While this makes the model more computationally expensive per token than an MoE model of a similar total size, it results in a higher ceiling for reasoning and knowledge retrieval accuracy.
A key technical highlight is the 128,000-token context window. This allows practitioners to process entire codebases, long legal documents, or complex technical manuals in a single prompt. For local users, the 128K context is particularly valuable for Retrieval-Augmented Generation (RAG) pipelines where high-precision synthesis of multiple documents is required.
The model was trained with a focus on multilingual support and code generation. It supports over 80 coding languages and dozens of natural languages, including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, and Chinese. The training cutoff is in 2024, ensuring it has knowledge of recent library updates and modern coding practices.
Mistral Large 2 is not a general-purpose "chat" toy; it is a workhorse for technical and analytical workloads. Its performance on the Mistral Large 2 reasoning benchmark scores puts it in the top tier of open-weight models, specifically in its ability to handle logic-heavy tasks without "hallucinating" logical steps.
Mistral Large 2 is one of the premier choices for Mistral Large 2 for coding tasks. It supports 80+ languages and is particularly proficient in Python, C++, Java, and Rust. Because of its large parameter count and dense architecture, it can handle complex refactoring tasks that require an understanding of cross-file dependencies—especially when utilizing the full 128K context window.
The model is natively trained for tool use and function-calling. For developers building local AI agents, Mistral Large 2 provides the reliability needed to output valid JSON and call external APIs or local scripts without frequent syntax errors. This makes it a viable core for local automation systems where reliability is non-negotiable.
While many models claim multilingualism, Mistral Large 2 maintains high reasoning capabilities across its supported languages. It can perform technical translation—such as translating a complex engineering spec from English to French—while preserving the underlying logic and technical accuracy of the text.
To run Mistral Large 2 locally, you must account for its 123B dense architecture. Unlike 70B models which can squeeze onto two consumer GPUs, or 8B models that run on a laptop, a 123B model requires a multi-GPU workstation or a high-end Mac Studio.
VRAM is the primary bottleneck for this model. Below are the estimated requirements based on quantization levels:
For the best GPU for Mistral Large 2, we recommend a multi-GPU setup or Unified Memory:
The best quantization for Mistral Large 2 for most practitioners is Q4_K_M or IQ4_XS. These 4-bit versions retain roughly 98-99% of the FP16 model's intelligence while reducing the memory footprint by over 60%.
In terms of Mistral Large 2 tokens per second, performance varies by hardware:
The quickest way to get started is using Ollama. Simply run ollama run mistral-large to pull the library's default quantization. However, for precise VRAM management, using llama.cpp directly allows you to offload specific layers to your GPUs to maximize your available hardware.
When evaluating Mistral Large 2 performance, it is most often compared to Llama 3.1 70B and Llama 3.1 405B.
Mistral Large 2 is significantly more capable than the Llama 3.1 70B model, particularly in coding and complex multilingual reasoning. However, the 70B model is much easier to run, fitting comfortably on two RTX 3090s. Choose Mistral Large 2 if your hardware supports it and your task requires "frontier-level" logic that 70B models occasionally fail.
Llama 3.1 405B is arguably the more powerful model in terms of raw knowledge and scale, but it is nearly impossible to run locally for most individuals (requiring ~800GB of VRAM for FP16 or ~250GB for 4-bit). Mistral Large 2 offers a "compressed" version of that same intelligence class. In many benchmarks, particularly coding, Mistral Large 2 punches way above its weight, often matching or exceeding the 405B model's utility in a package that is a fraction of the size.
While Mixtral 8x22B is an MoE model with a similar total parameter count (141B), it only uses 39B active parameters per token. This makes Mixtral faster for inference but generally less "smart" for deep reasoning compared to the 123B dense architecture of Mistral Large 2. If speed is your priority, use Mixtral; if accuracy and reasoning are your priorities, Mistral Large 2 is the superior choice for a local AI model 123B parameters 2025.