
An hyper-efficient 230B parameter MoE model activating only 10B parameters, designed for continuous operation at just $1 per hour.
Copy and paste this command to start running the model locally.
ollama run minimax-m2.5Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 20.6 GB | Low | |
| Q4_K_MRecommended | 22.7 GB | Good | |
| Q5_K_M | 23.7 GB | Very Good | |
| Q6_K | 24.9 GB | Excellent | |
| Q8_0 | 27.4 GB | Near Perfect | |
| FP16 | 36.9 GB | Full |
See which devices can run this model and at what quality level.
| SS | 58.2 tok/s | 22.7 GB | ||
| SS | 63.5 tok/s | 22.7 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 72.3 tok/s | 22.7 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 118.8 tok/s | 22.7 GB | |
| SS | 34.0 tok/s | 22.7 GB | ||
Google Cloud TPU v5pGoogle | SS | 98.1 tok/s | 22.7 GB | |
| SS | 86.9 tok/s | 22.7 GB | ||
| SS | 131.2 tok/s | 22.7 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 170.2 tok/s | 22.7 GB | |
NVIDIA L40SNVIDIA | SS | 30.6 tok/s | 22.7 GB | |
| SS | 188.0 tok/s | 22.7 GB | ||
Google TPU v7 (Ironwood)Google | SS | 261.7 tok/s | 22.7 GB | |
NVIDIA B200 GPUNVIDIA | SS | 283.7 tok/s | 22.7 GB | |
| SS | 212.8 tok/s | 22.7 GB | ||
| SS | 283.7 tok/s | 22.7 GB | ||
| SS | 251.8 tok/s | 22.7 GB | ||
| SS | 251.8 tok/s | 22.7 GB | ||
Gigabyte W775-V10-L01Gigabyte | SS | 251.8 tok/s | 22.7 GB | |
| SS | 251.8 tok/s | 22.7 GB | ||
| SS | 251.8 tok/s | 22.7 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 251.8 tok/s | 22.7 GB | |
| AA | 28.4 tok/s | 22.7 GB | ||
| AA | 35.7 tok/s | 22.7 GB | ||
| AA | 28.4 tok/s | 22.7 GB | ||
| AA | 34.0 tok/s | 22.7 GB |
MiniMax-M2.5 is a 230-billion parameter Mixture of Experts (MoE) model that represents a significant leap in inference efficiency for frontier-class local AI. Developed by the Shanghai-based MiniMax, this model is specifically engineered for high-stakes productivity tasks, including complex software engineering, autonomous agentic workflows, and long-context document analysis.
While the total parameter count reaches 230B, the model only activates 10B parameters per token. This architectural choice positions MiniMax-M2.5 as a direct competitor to other high-efficiency MoE models like DeepSeek-V3 or Mixtral-8x22B, offering a balance of high-reasoning capabilities with the inference speed typically associated with much smaller models. For developers looking to run MiniMax-m2.5 locally, the model provides a "Modified-MIT" license, making it accessible for a wide range of integration and deployment scenarios.
The defining characteristic of the MiniMax-M2.5 architecture is its sparse MoE structure. By routing inputs through only 10B active parameters out of the 230B total, the model achieves a massive reduction in the compute required for each token generation.
The 205k context window is a standout feature for practitioners dealing with massive codebases or legal documents. Unlike many open-weight models that degrade rapidly after 32k tokens, M2.5 is optimized for "BrowseComp" (context management), allowing it to maintain coherence and retrieval accuracy across its entire window. This makes it an ideal engine for RAG (Retrieval-Augmented Generation) pipelines where high-density information retrieval is required.
MiniMax-M2.5 is not a general-purpose "chat" model in the traditional sense; it is a productivity-first engine. It has been extensively trained using reinforcement learning (RL) within hundreds of thousands of real-world environments to excel at agentic tool use and complex reasoning.
M2.5 supports over 10 major programming languages, including Python, Rust, Go, C++, TypeScript, and Java. It is particularly effective at:
The model is designed to act as a "controller" for AI agents. With native support for function-calling and tool use, it can navigate terminal environments, execute web searches, and automate office tasks in Excel or Word. Its reasoning capabilities allow it to handle multi-step planning where an agent must self-correct based on the output of a previous tool execution.
With its 205,000-token window, M2.5 can ingest entire technical manuals or multiple research papers simultaneously. It is optimized to extract specific data points from the "middle" of the context, a common failure point for smaller MoE models.
The primary challenge for running a 230B model locally is not the compute (thanks to the 10B active parameters) but the VRAM required to house the weights. Even with MoE efficiency, you must load the majority of the 230B parameters into memory unless using advanced offloading techniques.
To fit MiniMax-M2.5 on consumer or prosumer hardware, quantization is mandatory.
The fastest way to test the model's capabilities is via Ollama. Note that the default library may point to a "cloud" version for API-based testing; for local execution, ensure you are pulling a quantized GGUF manifest:
ollama run minimax-m2.5
MiniMax-M2.5 occupies the "Heavyweight MoE" category, competing directly with DeepSeek-V3 and Llama-3.1-405B (though the latter is dense).
For practitioners looking for a local AI model with 230B parameters that doesn't sacrifice inference speed, MiniMax-M2.5 is currently one of the most viable "frontier-class" options available for local deployment on high-end workstations.