MiniMax

minimax-m2.5

An hyper-efficient 230B parameter MoE model activating only 10B parameters, designed for continuous operation at just $1 per hour.

230B paramsMoE205K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Summarization

Model Specifications

Parameters230B

Active Params10B

ArchitectureMoE

Context Length205K tokens

ModalityText Only

ProviderMiniMax

Download Size230.1 GB

Community

Monthly Downloads919.3K

Likes1.5K

Last Updated1 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run minimax-m2.5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified-MIT

Performance & Scoring

Benchmarks

GPQA

85.2

SWE-Verified

75.8

HLE

19.4

SWE-Pro

55.4

Arena Score

80.7

Overall Score

64.0BB

Benchmark40%

63.3

Popularity25%

67.3

Efficiency20%

69.6

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	20.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	22.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	23.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	24.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	27.4 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	36.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Google Cloud TPU v6e (Trillium)Google	SS	58.2 tok/s	22.7 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	63.5 tok/s	22.7 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	72.3 tok/s	22.7 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	118.8 tok/s	22.7 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	34.0 tok/s	22.7 GB
Google Cloud TPU v5pGoogle	SS	98.1 tok/s	22.7 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	86.9 tok/s	22.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	131.2 tok/s	22.7 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	170.2 tok/s	22.7 GB
NVIDIA L40SNVIDIA	SS	30.6 tok/s	22.7 GB
AMD Instinct MI300XAMD	SS	188.0 tok/s	22.7 GB
Google TPU v7 (Ironwood)Google	SS	261.7 tok/s	22.7 GB
NVIDIA B200 GPUNVIDIA	SS	283.7 tok/s	22.7 GB
AMD Instinct MI325XAMD	SS	212.8 tok/s	22.7 GB
AMD Instinct MI355XAMD	SS	283.7 tok/s	22.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	251.8 tok/s	22.7 GB
Dell Pro Max with GB300Dell	SS	251.8 tok/s	22.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	251.8 tok/s	22.7 GB
HP ZGX Fury AI StationHP	SS	251.8 tok/s	22.7 GB
MSI XpertStation WS300MSI	SS	251.8 tok/s	22.7 GB
SuperMicro Super AI StationSuperMicro	SS	251.8 tok/s	22.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	28.4 tok/s	22.7 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	35.7 tok/s	22.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	28.4 tok/s	22.7 GB
AMD Radeon RX 7900 XTXAMD	AA	34.0 tok/s	22.7 GB

Rows per page

Page 1 of 4

About This Model

MiniMax-M2.5 is a 230-billion parameter Mixture of Experts (MoE) model that represents a significant leap in inference efficiency for frontier-class local AI. Developed by the Shanghai-based MiniMax, this model is specifically engineered for high-stakes productivity tasks, including complex software engineering, autonomous agentic workflows, and long-context document analysis.

While the total parameter count reaches 230B, the model only activates 10B parameters per token. This architectural choice positions MiniMax-M2.5 as a direct competitor to other high-efficiency MoE models like DeepSeek-V3 or Mixtral-8x22B, offering a balance of high-reasoning capabilities with the inference speed typically associated with much smaller models. For developers looking to run MiniMax-m2.5 locally, the model provides a "Modified-MIT" license, making it accessible for a wide range of integration and deployment scenarios.

Architecture and Technical Details

The defining characteristic of the MiniMax-M2.5 architecture is its sparse MoE structure. By routing inputs through only 10B active parameters out of the 230B total, the model achieves a massive reduction in the compute required for each token generation.

Total Parameters: 230B
Active Parameters: 10B
Architecture: Mixture of Experts (MoE)
Context Length: 205,000 tokens
Modality: Text-only

The 205k context window is a standout feature for practitioners dealing with massive codebases or legal documents. Unlike many open-weight models that degrade rapidly after 32k tokens, M2.5 is optimized for "BrowseComp" (context management), allowing it to maintain coherence and retrieval accuracy across its entire window. This makes it an ideal engine for RAG (Retrieval-Augmented Generation) pipelines where high-density information retrieval is required.

Capabilities and Use Cases

MiniMax-M2.5 is not a general-purpose "chat" model in the traditional sense; it is a productivity-first engine. It has been extensively trained using reinforcement learning (RL) within hundreds of thousands of real-world environments to excel at agentic tool use and complex reasoning.

Advanced Coding and Software Engineering

M2.5 supports over 10 major programming languages, including Python, Rust, Go, C++, TypeScript, and Java. It is particularly effective at:

Architectural Planning: Decomposing complex software requirements into modular components.
Autonomous Debugging: Identifying logic errors across multiple files within its 205k context.
SWE-Bench Performance: The model has demonstrated high proficiency in resolving real-world GitHub issues, outperforming many larger dense models in coding speed and accuracy.

Agentic Workflows and Function Calling

The model is designed to act as a "controller" for AI agents. With native support for function-calling and tool use, it can navigate terminal environments, execute web searches, and automate office tasks in Excel or Word. Its reasoning capabilities allow it to handle multi-step planning where an agent must self-correct based on the output of a previous tool execution.

Long-Context Summarization

With its 205,000-token window, M2.5 can ingest entire technical manuals or multiple research papers simultaneously. It is optimized to extract specific data points from the "middle" of the context, a common failure point for smaller MoE models.

Running MiniMax-M2.5 Locally

The primary challenge for running a 230B model locally is not the compute (thanks to the 10B active parameters) but the VRAM required to house the weights. Even with MoE efficiency, you must load the majority of the 230B parameters into memory unless using advanced offloading techniques.

VRAM Requirements and Quantization

To fit MiniMax-M2.5 on consumer or prosumer hardware, quantization is mandatory.

Q4_K_M (4-bit): Requires approximately 135GB - 145GB of VRAM. This is the "sweet spot" for maintaining intelligence while fitting on a multi-GPU setup.
Q2_K (2-bit): Requires approximately 80GB - 90GB of VRAM. This allows the model to fit on a single NVIDIA H100 or a Mac Studio with 128GB of Unified Memory, though with a noticeable hit to reasoning accuracy.
FP16 (Unquantized): Requires ~460GB of VRAM. This is generally reserved for enterprise-grade clusters (e.g., 8x A100/H100 80GB).

Recommended Hardware

Mac Users: An M2 Ultra or M4 Max with 128GB or 192GB of Unified Memory is the most seamless way to run this model.
PC/Linux Users: A multi-GPU setup is required. A common configuration is 4x RTX 3090/4090 (96GB total) for 3-bit quantization, or a specialized workstation with 6x RTX 4090s for 4-bit.
Inference Speed: Expect minimax-m2.5 tokens per second to range between 15-30 t/s on optimized Mac silicon and 40-60 t/s on multi-GPU Linux setups using vLLM or llama.cpp.

Quick Start with Ollama

The fastest way to test the model's capabilities is via Ollama. Note that the default library may point to a "cloud" version for API-based testing; for local execution, ensure you are pulling a quantized GGUF manifest:

ollama run minimax-m2.5

How It Compares

MiniMax-M2.5 occupies the "Heavyweight MoE" category, competing directly with DeepSeek-V3 and Llama-3.1-405B (though the latter is dense).

MiniMax-M2.5 vs. DeepSeek-V3: Both utilize MoE architectures. M2.5 tends to show higher stability in agentic tool use and "office" automation tasks, while DeepSeek-V3 often leads in pure mathematics and raw coding benchmarks. M2.5's 205k context is slightly more robust for long-form retrieval.
MiniMax-M2.5 vs. Llama-3.1-405B: Llama-405B is a dense model, meaning every parameter is active for every token. This makes Llama-405B significantly slower and more expensive to run locally than M2.5. If your hardware is VRAM-rich but compute-constrained, M2.5 offers a much snappier user experience.
MiniMax-M2.5 vs. Mixtral-8x22B: M2.5 is a substantially larger model in terms of total knowledge (230B vs 141B). While Mixtral is easier to fit on dual-GPU setups, M2.5 provides a clear upgrade in reasoning depth and complex task decomposition.

For practitioners looking for a local AI model with 230B parameters that doesn't sacrifice inference speed, MiniMax-M2.5 is currently one of the most viable "frontier-class" options available for local deployment on high-end workstations.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

MiniMax

minimax-m2.5

An hyper-efficient 230B parameter MoE model activating only 10B parameters, designed for continuous operation at just $1 per hour.

230B paramsMoE205K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Summarization

Model Specifications

Parameters230B

Active Params10B

ArchitectureMoE

Context Length205K tokens

ModalityText Only

ProviderMiniMax

Download Size230.1 GB

Community

Monthly Downloads919.3K

Likes1.5K

Last Updated1 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run minimax-m2.5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified-MIT

Performance & Scoring

Benchmarks

GPQA

85.2

SWE-Verified

75.8

HLE

19.4

SWE-Pro

55.4

Arena Score

80.7

Overall Score

64.0BB

Benchmark40%

63.3

Popularity25%

67.3

Efficiency20%

69.6

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	20.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	22.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	23.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	24.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	27.4 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	36.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Google Cloud TPU v6e (Trillium)Google	SS	58.2 tok/s	22.7 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	63.5 tok/s	22.7 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	72.3 tok/s	22.7 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	118.8 tok/s	22.7 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	34.0 tok/s	22.7 GB
Google Cloud TPU v5pGoogle	SS	98.1 tok/s	22.7 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	86.9 tok/s	22.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	131.2 tok/s	22.7 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	170.2 tok/s	22.7 GB
NVIDIA L40SNVIDIA	SS	30.6 tok/s	22.7 GB
AMD Instinct MI300XAMD	SS	188.0 tok/s	22.7 GB
Google TPU v7 (Ironwood)Google	SS	261.7 tok/s	22.7 GB
NVIDIA B200 GPUNVIDIA	SS	283.7 tok/s	22.7 GB
AMD Instinct MI325XAMD	SS	212.8 tok/s	22.7 GB
AMD Instinct MI355XAMD	SS	283.7 tok/s	22.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	251.8 tok/s	22.7 GB
Dell Pro Max with GB300Dell	SS	251.8 tok/s	22.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	251.8 tok/s	22.7 GB
HP ZGX Fury AI StationHP	SS	251.8 tok/s	22.7 GB
MSI XpertStation WS300MSI	SS	251.8 tok/s	22.7 GB
SuperMicro Super AI StationSuperMicro	SS	251.8 tok/s	22.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	28.4 tok/s	22.7 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	35.7 tok/s	22.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	28.4 tok/s	22.7 GB
AMD Radeon RX 7900 XTXAMD	AA	34.0 tok/s	22.7 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Details

Total Parameters: 230B
Active Parameters: 10B
Architecture: Mixture of Experts (MoE)
Context Length: 205,000 tokens
Modality: Text-only

Capabilities and Use Cases

Advanced Coding and Software Engineering

M2.5 supports over 10 major programming languages, including Python, Rust, Go, C++, TypeScript, and Java. It is particularly effective at:

Architectural Planning: Decomposing complex software requirements into modular components.
Autonomous Debugging: Identifying logic errors across multiple files within its 205k context.
SWE-Bench Performance: The model has demonstrated high proficiency in resolving real-world GitHub issues, outperforming many larger dense models in coding speed and accuracy.

Agentic Workflows and Function Calling

Long-Context Summarization

Running MiniMax-M2.5 Locally

VRAM Requirements and Quantization

To fit MiniMax-M2.5 on consumer or prosumer hardware, quantization is mandatory.

Q4_K_M (4-bit): Requires approximately 135GB - 145GB of VRAM. This is the "sweet spot" for maintaining intelligence while fitting on a multi-GPU setup.
Q2_K (2-bit): Requires approximately 80GB - 90GB of VRAM. This allows the model to fit on a single NVIDIA H100 or a Mac Studio with 128GB of Unified Memory, though with a noticeable hit to reasoning accuracy.
FP16 (Unquantized): Requires ~460GB of VRAM. This is generally reserved for enterprise-grade clusters (e.g., 8x A100/H100 80GB).

Recommended Hardware

Mac Users: An M2 Ultra or M4 Max with 128GB or 192GB of Unified Memory is the most seamless way to run this model.
PC/Linux Users: A multi-GPU setup is required. A common configuration is 4x RTX 3090/4090 (96GB total) for 3-bit quantization, or a specialized workstation with 6x RTX 4090s for 4-bit.
Inference Speed: Expect minimax-m2.5 tokens per second to range between 15-30 t/s on optimized Mac silicon and 40-60 t/s on multi-GPU Linux setups using vLLM or llama.cpp.

Quick Start with Ollama

ollama run minimax-m2.5

How It Compares

MiniMax-M2.5 occupies the "Heavyweight MoE" category, competing directly with DeepSeek-V3 and Llama-3.1-405B (though the latter is dense).

MiniMax-M2.5 vs. DeepSeek-V3: Both utilize MoE architectures. M2.5 tends to show higher stability in agentic tool use and "office" automation tasks, while DeepSeek-V3 often leads in pure mathematics and raw coding benchmarks. M2.5's 205k context is slightly more robust for long-form retrieval.
MiniMax-M2.5 vs. Llama-3.1-405B: Llama-405B is a dense model, meaning every parameter is active for every token. This makes Llama-405B significantly slower and more expensive to run locally than M2.5. If your hardware is VRAM-rich but compute-constrained, M2.5 offers a much snappier user experience.
MiniMax-M2.5 vs. Mixtral-8x22B: M2.5 is a substantially larger model in terms of total knowledge (230B vs 141B). While Mixtral is easier to fit on dual-GPU setups, M2.5 provides a clear upgrade in reasoning depth and complex task decomposition.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.