Meta

Llama 4 Maverick

Meta's high-capacity MoE with 17B active / 128 experts from 400B total. Beats GPT-4o and Gemini 2.0 Flash. 1M token context. Natively multimodal.

400B paramsMoE1000K ctxText + Vision

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A situational 400B-parameter MoE language model from Meta. Sits below the open-model average on our benchmarks; pick it for specific deployment constraints rather than peak quality.

Run this onAMD Instinct MI300XCheapest card in our directory with comfortable headroom (192 GB) for this model at Q4 (~146.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters400B

Active Params17B

ArchitectureMoE

Context Length1M tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderMeta

Download Size1.0 TB

Community

Monthly Downloads47.9K

Likes483

Last Updated12 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama4:maverick

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 4 Community LicenseView Full License

Performance & Scoring

Benchmarks

SWE-Pro

5.2

Arena Score

68.1

Overall Score

33.0DD

Benchmark40%

36.7

Popularity25%

21.8

Efficiency20%

5.4

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	142.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	146.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	148.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	150.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	154.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	170.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Instinct MI355XAMD	SS	44.0 tok/s	146.4 GB
Google TPU v7 (Ironwood)Google	SS	40.6 tok/s	146.4 GB
NVIDIA B200 GPUNVIDIA	SS	44.0 tok/s	146.4 GB
AMD Instinct MI325XAMD	SS	33.0 tok/s	146.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	39.1 tok/s	146.4 GB
Dell Pro Max with GB300Dell	SS	39.1 tok/s	146.4 GB
HP ZGX Fury AI StationHP	SS	39.1 tok/s	146.4 GB
MSI XpertStation WS300MSI	SS	39.1 tok/s	146.4 GB
SuperMicro Super AI StationSuperMicro	SS	39.1 tok/s	146.4 GB
Gigabyte W775-V10-L01Gigabyte	SS	39.1 tok/s	146.4 GB
AMD Instinct MI300XAMD	SS	29.2 tok/s	146.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	4.4 tok/s	146.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	4.5 tok/s	146.4 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	4.5 tok/s	146.4 GB
NVIDIA H200 SXM 141GBNVIDIA	BB	26.4 tok/s	146.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	2.8 tok/s	146.4 GB
Acer Veriton GN100 AI MiniAcer	FF	1.5 tok/s	146.4 GB
AMD Radeon RX 7600 8GBAMD	FF	1.6 tok/s	146.4 GB
AMD Radeon RX 7700 XTAMD	FF	2.4 tok/s	146.4 GB
AMD Radeon RX 7800 XTAMD	FF	3.4 tok/s	146.4 GB
AMD Radeon RX 7900 XTAMD	FF	4.4 tok/s	146.4 GB
AMD Radeon RX 7900 XTXAMD	FF	5.3 tok/s	146.4 GB
AMD Radeon RX 9070AMD	FF	3.5 tok/s	146.4 GB
AMD Radeon RX 9070 XTAMD	FF	3.5 tok/s	146.4 GB
Apple M4Apple	FF	0.7 tok/s	146.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA H200 SXM 141GB (~26 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 4 Maverick on NVIDIA H200 SXM 141GB · ~26 tok/s · 700W	$0.884
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 146 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
NVIDIA B200Vast.ai · Spot · 192 GB VRAM	$2.69
NVIDIA B200Vast.ai · On-Demand · 192 GB VRAM	$3.78

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Llama 4 Maverick is Meta’s flagship Mixture of Experts (MoE) model designed to bridge the gap between massive scale and inference efficiency. With 400B total parameters but only 17B active parameters per token, Maverick represents a significant shift in how high-capacity models are deployed. It is a natively multimodal model, handling text and vision inputs with a massive 1-million-token context window, making it a direct competitor to GPT-4o and Gemini 2.0 Flash for local deployment.

For practitioners, the value of Llama 4 Maverick lies in its reasoning density. By utilizing a 128-expert architecture, the model maintains the knowledge base and reasoning capabilities of a 400B-class model while operating with the compute requirements of a much smaller dense model. This architecture allows it to excel in complex instruction-following, advanced mathematics, and deep-context retrieval tasks that were previously impossible on local hardware.

Architecture & Technical Details

The Llama 4 Maverick MoE efficiency is derived from its sparse activation strategy. While the model occupies a 400B parameter footprint in memory, only 17B parameters are engaged during a single forward pass. This means that while VRAM requirements remain high to store the expert weights, the "compute cost" or FLOPs required to generate a token is drastically lower than a dense 400B model.

Key Specifications:

Total Parameters: 400B
Active Parameters: 17B
Expert Configuration: 128 Experts
Context Window: 1,000,000 tokens
Modality: Native Text + Vision
Training Cutoff: August 2024
License: Llama 4 Community License

The 1M token context length is a critical feature for local RAG (Retrieval-Augmented Generation) and long-document analysis. Unlike previous iterations, Maverick utilizes a natively multimodal architecture, meaning vision capabilities are not "bolted on" via a separate encoder but are integrated into the core transformer blocks. This leads to higher spatial reasoning accuracy when analyzing images or complex diagrams.

Capabilities & Use Cases

Llama 4 Maverick is tuned for high-agency tasks. Its Llama 4 Maverick reasoning benchmark results place it at the top of the open-weights class, particularly in multi-step problem solving and logic-heavy workflows.

Advanced Coding and Engineering

Llama 4 Maverick for coding is a significant upgrade over the Llama 3 series. It handles complex repository-level refactoring and can ingest entire documentation sets into its 1M context window. It supports function-calling out of the box, allowing it to act as an agent that interacts with local compilers, debuggers, and terminal environments.

Visual Reasoning and Document Processing

With native vision, Maverick can process architectural blueprints, financial charts, and handwritten notes. In a local environment, this is ideal for sensitive document processing where data privacy prevents the use of cloud APIs. It can extract structured JSON from complex forms or describe temporal changes across a sequence of images.

Multilingual and Mathematical Proficiency

The model shows high ceiling performance in STEM subjects. Its ability to solve competitive-level math problems and follow complex, multi-lingual instructions makes it suitable for translating technical documentation across 30+ languages while maintaining technical accuracy.

Running Llama 4 Maverick Locally

To run Llama 4 Maverick locally, the primary bottleneck is VRAM capacity, not necessarily GPU compute cycles. Because only 17B parameters are active per token, the Llama 4 Maverick tokens per second (TPS) can be surprisingly high once the model is loaded into memory.

Llama 4 Maverick Hardware Requirements

Because this is a 400B parameter model, you cannot run it on a single consumer GPU at high precision. You must look at multi-GPU setups or high-unified-memory workstations.

FP16 (Full Precision): ~800GB VRAM (Requires enterprise hardware like 10x H100 or A100 80GB).
Q4_K_M (4-bit Quantization): ~230GB - 250GB VRAM. This is the best quantization for Llama 4 Maverick for those seeking a balance between perplexity and footprint.
Q2_K (2-bit Quantization): ~120GB - 140GB VRAM. This is the minimum entry point for "usable" reasoning.

Best GPU for Llama 4 Maverick

For professional local setups, the M4 Max or M3 Ultra Mac Studio with 192GB of Unified Memory is the most efficient way to run this model. On the PC side, a 4x RTX 6000 Ada (48GB each) or a 7x RTX 4090 (24GB each) cluster is required to fit a 4-bit quantization.

If you are wondering how to run 400B model on consumer GPU hardware, the answer is usually "partial offloading" or "GGUF split across multiple cards." Using Ollama is the fastest way to get started, as it handles the layer splitting across multiple GPUs automatically.

Performance Expectations

On a Mac Studio (M2/M3/M4 Ultra), expect 5–12 tokens per second. On a multi-4090 setup using vLLM or llama.cpp, you can achieve 15+ tokens per second due to the MoE architecture's efficiency.

How It Compares

When evaluating Llama 4 Maverick vs DeepSeek-V3 or Grok-1, the trade-offs center on the MoE implementation and the context window.

Llama 4 Maverick vs DeepSeek-V3: DeepSeek-V3 uses a Multi-head Latent Attention (MLA) and a different MoE structure. While DeepSeek is highly efficient for coding, Llama 4 Maverick generally offers superior vision integration and a significantly larger context window (1M vs 128k).
Llama 4 Maverick vs Llama 3.1 405B: Maverick is significantly faster for inference thanks to the MoE architecture (17B active vs 405B dense). If you have the VRAM to store the weights, Maverick is the superior choice for almost every use case due to the speed-to-intelligence ratio.

Llama 4 Maverick is currently the top-tier choice for a local AI model 400B parameters 2025. It provides GPT-4 level intelligence with the privacy and persistence of local execution, provided you have the VRAM to house its 128 experts.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 4 Maverick

Meta's high-capacity MoE with 17B active / 128 experts from 400B total. Beats GPT-4o and Gemini 2.0 Flash. 1M token context. Natively multimodal.

400B paramsMoE1000K ctxText + Vision

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A situational 400B-parameter MoE language model from Meta. Sits below the open-model average on our benchmarks; pick it for specific deployment constraints rather than peak quality.

Run this onAMD Instinct MI300XCheapest card in our directory with comfortable headroom (192 GB) for this model at Q4 (~146.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters400B

Active Params17B

ArchitectureMoE

Context Length1M tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderMeta

Download Size1.0 TB

Community

Monthly Downloads47.9K

Likes483

Last Updated12 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama4:maverick

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 4 Community LicenseView Full License

Performance & Scoring

Benchmarks

SWE-Pro

5.2

Arena Score

68.1

Overall Score

33.0DD

Benchmark40%

36.7

Popularity25%

21.8

Efficiency20%

5.4

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	142.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	146.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	148.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	150.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	154.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	170.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Instinct MI355XAMD	SS	44.0 tok/s	146.4 GB
Google TPU v7 (Ironwood)Google	SS	40.6 tok/s	146.4 GB
NVIDIA B200 GPUNVIDIA	SS	44.0 tok/s	146.4 GB
AMD Instinct MI325XAMD	SS	33.0 tok/s	146.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	39.1 tok/s	146.4 GB
Dell Pro Max with GB300Dell	SS	39.1 tok/s	146.4 GB
HP ZGX Fury AI StationHP	SS	39.1 tok/s	146.4 GB
MSI XpertStation WS300MSI	SS	39.1 tok/s	146.4 GB
SuperMicro Super AI StationSuperMicro	SS	39.1 tok/s	146.4 GB
Gigabyte W775-V10-L01Gigabyte	SS	39.1 tok/s	146.4 GB
AMD Instinct MI300XAMD	SS	29.2 tok/s	146.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	4.4 tok/s	146.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	4.5 tok/s	146.4 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	4.5 tok/s	146.4 GB
NVIDIA H200 SXM 141GBNVIDIA	BB	26.4 tok/s	146.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	2.8 tok/s	146.4 GB
Acer Veriton GN100 AI MiniAcer	FF	1.5 tok/s	146.4 GB
AMD Radeon RX 7600 8GBAMD	FF	1.6 tok/s	146.4 GB
AMD Radeon RX 7700 XTAMD	FF	2.4 tok/s	146.4 GB
AMD Radeon RX 7800 XTAMD	FF	3.4 tok/s	146.4 GB
AMD Radeon RX 7900 XTAMD	FF	4.4 tok/s	146.4 GB
AMD Radeon RX 7900 XTXAMD	FF	5.3 tok/s	146.4 GB
AMD Radeon RX 9070AMD	FF	3.5 tok/s	146.4 GB
AMD Radeon RX 9070 XTAMD	FF	3.5 tok/s	146.4 GB
Apple M4Apple	FF	0.7 tok/s	146.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA H200 SXM 141GB (~26 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 4 Maverick on NVIDIA H200 SXM 141GB · ~26 tok/s · 700W	$0.884
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 146 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
NVIDIA B200Vast.ai · Spot · 192 GB VRAM	$2.69
NVIDIA B200Vast.ai · On-Demand · 192 GB VRAM	$3.78

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key Specifications:

Total Parameters: 400B
Active Parameters: 17B
Expert Configuration: 128 Experts
Context Window: 1,000,000 tokens
Modality: Native Text + Vision
Training Cutoff: August 2024
License: Llama 4 Community License

Capabilities & Use Cases

Advanced Coding and Engineering

Visual Reasoning and Document Processing

Multilingual and Mathematical Proficiency

Running Llama 4 Maverick Locally

Llama 4 Maverick Hardware Requirements

Because this is a 400B parameter model, you cannot run it on a single consumer GPU at high precision. You must look at multi-GPU setups or high-unified-memory workstations.

FP16 (Full Precision): ~800GB VRAM (Requires enterprise hardware like 10x H100 or A100 80GB).
Q4_K_M (4-bit Quantization): ~230GB - 250GB VRAM. This is the best quantization for Llama 4 Maverick for those seeking a balance between perplexity and footprint.
Q2_K (2-bit Quantization): ~120GB - 140GB VRAM. This is the minimum entry point for "usable" reasoning.

Best GPU for Llama 4 Maverick

Performance Expectations

On a Mac Studio (M2/M3/M4 Ultra), expect 5–12 tokens per second. On a multi-4090 setup using vLLM or llama.cpp, you can achieve 15+ tokens per second due to the MoE architecture's efficiency.

How It Compares

When evaluating Llama 4 Maverick vs DeepSeek-V3 or Grok-1, the trade-offs center on the MoE implementation and the context window.

Llama 4 Maverick vs DeepSeek-V3: DeepSeek-V3 uses a Multi-head Latent Attention (MLA) and a different MoE structure. While DeepSeek is highly efficient for coding, Llama 4 Maverick generally offers superior vision integration and a significantly larger context window (1M vs 128k).
Llama 4 Maverick vs Llama 3.1 405B: Maverick is significantly faster for inference thanks to the MoE architecture (17B active vs 405B dense). If you have the VRAM to store the weights, Maverick is the superior choice for almost every use case due to the speed-to-intelligence ratio.