made by agents

Meta's high-capacity MoE with 17B active / 128 experts from 400B total. Beats GPT-4o and Gemini 2.0 Flash. 1M token context. Natively multimodal.
Copy and paste this command to start running the model locally.
ollama run llama4:maverickAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 142.8 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 146.4 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 148.1 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 150.1 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 154.3 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 170.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 44.0 tok/s | 146.4 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 44.0 tok/s | 146.4 GB | |
| SS | 33.0 tok/s | 146.4 GB | ||
| SS | 29.2 tok/s | 146.4 GB | ||
| BB | 4.4 tok/s | 146.4 GB | ||
| BB | 4.5 tok/s | 146.4 GB | ||
| BB | 4.5 tok/s | 146.4 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | BB | 26.4 tok/s | 146.4 GB | |
| FF | 1.6 tok/s | 146.4 GB | ||
| FF | 2.4 tok/s | 146.4 GB | ||
| FF | 3.4 tok/s | 146.4 GB | ||
| FF | 4.4 tok/s | 146.4 GB | ||
| FF | 5.3 tok/s | 146.4 GB | ||
| FF | 3.5 tok/s | 146.4 GB | ||
| FF | 3.5 tok/s | 146.4 GB | ||
Apple M4Apple | FF | 0.7 tok/s | 146.4 GB | |
| FF | 3.0 tok/s | 146.4 GB | ||
| FF | 1.5 tok/s | 146.4 GB | ||
Apple M5Apple | FF | 0.8 tok/s | 146.4 GB | |
| FF | 3.4 tok/s | 146.4 GB | ||
| FF | 1.7 tok/s | 146.4 GB | ||
| FF | 0.4 tok/s | 146.4 GB | ||
| FF | 0.6 tok/s | 146.4 GB | ||
| FF | 1.1 tok/s | 146.4 GB | ||
| FF | 0.7 tok/s | 146.4 GB |
Llama 4 Maverick is Meta’s flagship Mixture of Experts (MoE) model designed to bridge the gap between massive scale and inference efficiency. With 400B total parameters but only 17B active parameters per token, Maverick represents a significant shift in how high-capacity models are deployed. It is a natively multimodal model, handling text and vision inputs with a massive 1-million-token context window, making it a direct competitor to GPT-4o and Gemini 2.0 Flash for local deployment.
For practitioners, the value of Llama 4 Maverick lies in its reasoning density. By utilizing a 128-expert architecture, the model maintains the knowledge base and reasoning capabilities of a 400B-class model while operating with the compute requirements of a much smaller dense model. This architecture allows it to excel in complex instruction-following, advanced mathematics, and deep-context retrieval tasks that were previously impossible on local hardware.
The Llama 4 Maverick MoE efficiency is derived from its sparse activation strategy. While the model occupies a 400B parameter footprint in memory, only 17B parameters are engaged during a single forward pass. This means that while VRAM requirements remain high to store the expert weights, the "compute cost" or FLOPs required to generate a token is drastically lower than a dense 400B model.
The 1M token context length is a critical feature for local RAG (Retrieval-Augmented Generation) and long-document analysis. Unlike previous iterations, Maverick utilizes a natively multimodal architecture, meaning vision capabilities are not "bolted on" via a separate encoder but are integrated into the core transformer blocks. This leads to higher spatial reasoning accuracy when analyzing images or complex diagrams.
Llama 4 Maverick is tuned for high-agency tasks. Its Llama 4 Maverick reasoning benchmark results place it at the top of the open-weights class, particularly in multi-step problem solving and logic-heavy workflows.
Llama 4 Maverick for coding is a significant upgrade over the Llama 3 series. It handles complex repository-level refactoring and can ingest entire documentation sets into its 1M context window. It supports function-calling out of the box, allowing it to act as an agent that interacts with local compilers, debuggers, and terminal environments.
With native vision, Maverick can process architectural blueprints, financial charts, and handwritten notes. In a local environment, this is ideal for sensitive document processing where data privacy prevents the use of cloud APIs. It can extract structured JSON from complex forms or describe temporal changes across a sequence of images.
The model shows high ceiling performance in STEM subjects. Its ability to solve competitive-level math problems and follow complex, multi-lingual instructions makes it suitable for translating technical documentation across 30+ languages while maintaining technical accuracy.
To run Llama 4 Maverick locally, the primary bottleneck is VRAM capacity, not necessarily GPU compute cycles. Because only 17B parameters are active per token, the Llama 4 Maverick tokens per second (TPS) can be surprisingly high once the model is loaded into memory.
Because this is a 400B parameter model, you cannot run it on a single consumer GPU at high precision. You must look at multi-GPU setups or high-unified-memory workstations.
For professional local setups, the M4 Max or M3 Ultra Mac Studio with 192GB of Unified Memory is the most efficient way to run this model. On the PC side, a 4x RTX 6000 Ada (48GB each) or a 7x RTX 4090 (24GB each) cluster is required to fit a 4-bit quantization.
If you are wondering how to run 400B model on consumer GPU hardware, the answer is usually "partial offloading" or "GGUF split across multiple cards." Using Ollama is the fastest way to get started, as it handles the layer splitting across multiple GPUs automatically.
On a Mac Studio (M2/M3/M4 Ultra), expect 5–12 tokens per second. On a multi-4090 setup using vLLM or llama.cpp, you can achieve 15+ tokens per second due to the MoE architecture's efficiency.
When evaluating Llama 4 Maverick vs DeepSeek-V3 or Grok-1, the trade-offs center on the MoE implementation and the context window.
Llama 4 Maverick is currently the top-tier choice for a local AI model 400B parameters 2025. It provides GPT-4 level intelligence with the privacy and persistence of local execution, provided you have the VRAM to house its 128 experts.