made by agents

Meta's efficient MoE with 17B active / 16 experts. First Llama with native multimodality and 10M token context window. Fits on a single H100.
Copy and paste this command to start running the model locally.
ollama run llama4:scoutAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 1366.8 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 1370.4 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 1372.1 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 1374.1 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 1378.3 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 1394.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| FF | 3.1 tok/s | 1370.4 GB | ||
| FF | 3.5 tok/s | 1370.4 GB | ||
| FF | 4.7 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB | ||
| FF | 0.3 tok/s | 1370.4 GB | ||
| FF | 0.4 tok/s | 1370.4 GB | ||
| FF | 0.5 tok/s | 1370.4 GB | ||
| FF | 0.6 tok/s | 1370.4 GB | ||
| FF | 0.4 tok/s | 1370.4 GB | ||
| FF | 0.4 tok/s | 1370.4 GB | ||
| FF | 0.5 tok/s | 1370.4 GB | ||
Apple M4Apple | FF | 0.1 tok/s | 1370.4 GB | |
| FF | 0.3 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB | ||
Apple M5Apple | FF | 0.1 tok/s | 1370.4 GB | |
| FF | 0.4 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB | ||
| FF | 0.0 tok/s | 1370.4 GB | ||
| FF | 0.1 tok/s | 1370.4 GB | ||
| FF | 0.1 tok/s | 1370.4 GB | ||
| FF | 0.1 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB | ||
| FF | 0.5 tok/s | 1370.4 GB | ||
| FF | 0.2 tok/s | 1370.4 GB |
Meta’s Llama 4 Scout is a 109B parameter Mixture of Experts (MoE) model designed to bridge the gap between high-performance dense models and the efficiency required for local deployments. As the first model in the Llama lineage to feature native multimodality and a massive 10-million-token context window, Scout is positioned as a powerhouse for long-form document analysis, complex codebase reasoning, and vision-integrated workflows.
While the total parameter count sits at 109B, the MoE architecture ensures that only 17B parameters are active during any single inference pass. This allows the model to deliver the reasoning capabilities of a 100B+ model with the throughput speeds typically associated with much smaller architectures. Trained on data with a cutoff of August 2024, Llama 4 Scout is optimized for practitioners who need a high-reasoning local AI model that can fit within professional workstation hardware, such as a single NVIDIA H100 or multi-GPU consumer setups.
The defining characteristic of Llama 4 Scout is its Mixture of Experts (MoE) design. Unlike dense models where every parameter is activated for every token, Scout utilizes 16 distinct experts. For each token processed, the model routes the workload to the most relevant experts, resulting in only 17B active parameters.
This architecture provides two distinct advantages for local practitioners:
Llama 4 Scout introduces a 10,000,000 token context window. This is a generational leap over Llama 3.1’s 128k window. This massive capacity enables native RAG-less (Retrieval-Augmented Generation) workflows where entire technical libraries, thousands of high-resolution images, or massive datasets can be loaded directly into the KV cache for immediate reasoning.
Unlike previous iterations that relied on adapter-based vision modules, Scout features native multimodality. The vision and text encoders are deeply integrated, allowing the model to reason across visual and textual data simultaneously. This is critical for tasks like architectural diagram analysis, OCR on complex forms, and spatial reasoning within video frames.
Llama 4 Scout is not a general-purpose "chat" model; it is a reasoning engine. Its training emphasizes instruction-following and logic, making it particularly effective for technical pipelines.
With its 10M context window, Llama 4 Scout for coding excels at repository-level understanding. You can feed the model an entire codebase to identify architectural bottlenecks, perform security audits, or refactor legacy code across hundreds of files. Its performance on Llama 4 Scout reasoning benchmarks indicates a significant improvement in logic-heavy tasks like debugging complex asynchronous logic in Rust or C++.
The native vision capabilities allow Scout to act as a sophisticated document processor. It can handle:
The August 2024 training cutoff ensures the model is current with modern mathematical notation and diverse linguistic nuances. It supports over 30 languages with high proficiency, making it a viable choice for local translation and localization tasks that require high context retention.
To run Llama 4 Scout locally, you must distinguish between compute requirements and memory requirements. While the 17B active parameters make it fast, the full 109B parameters must reside in VRAM for the model to function without significant offloading penalties.
VRAM is the primary bottleneck for this model. Because it is a 109B parameter model, the memory footprint is substantial:
For most practitioners, the best GPU for Llama 4 Scout depends on your budget:
For daily use, the best quantization for Llama 4 Scout is Q4_K_M GGUF or 4.0bpw EXL2. At 4-bit, the model retains nearly all its reasoning capabilities while fitting into a ~68GB footprint. If you are prioritizing the 10M context window, you may need to drop to Q3_K_S to leave room for the massive KV cache, which grows linearly with context usage.
The quickest way to deploy is via Ollama. Once your hardware is configured, you can pull the model directly:
ollama run llama4-scout:109b-q4_K_M
When evaluating Llama 4 Scout hardware requirements and performance, it is helpful to compare it against its closest competitors in the 100B+ parameter range.
Mixtral 8x22B is the other major player in the open-weights MoE space.
While Llama 3.1 70B is a dense model and easier to fit on 2x RTX 3090s (48GB VRAM), Scout offers a significant jump in intelligence.
Llama 4 Scout is the definitive local AI model 109B parameters 2025 choice for users who need a "big model" experience without the latency of a massive dense architecture. If you have the VRAM to support it, the combination of 10M context and native vision makes it one of the most versatile tools available for local deployment.