Meta

Llama 4 Scout

Meta's efficient MoE with 17B active / 16 experts. First Llama with native multimodality and 10M token context window. Fits on a single H100.

109B paramsMoE10000K ctxText + Vision

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A workable 109B-parameter MoE language model from Meta. Pulls ahead on competition math (AIME 2026) (85/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA GB200 NVL72 Rack SystemCheapest card in our directory with comfortable headroom (13824 GB) for this model at Q4 (~1370.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters109B

Active Params17B

ArchitectureMoE

Context Length10M tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderMeta

Download Size217.3 GB

Community

Monthly Downloads407.9K

Likes1.3K

Last Updated12 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama4:scout

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 4 Community LicenseView Full License

Performance & Scoring

Benchmarks

73.0

55.0

12.0

85.0

5.2

67.1

Overall Score

44.3CC

Benchmark40%

49.5

Popularity25%

50.3

Efficiency20%

0.0

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	1366.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	1370.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	1372.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	1374.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	1378.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	1394.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	0.3 tok/s	1370.4 GB
Acer Veriton GN100 AI MiniAcer	FF	0.2 tok/s	1370.4 GB
AMD Instinct MI300XAMD	FF	3.1 tok/s	1370.4 GB
AMD Instinct MI325XAMD	FF	3.5 tok/s	1370.4 GB
AMD Instinct MI355XAMD	FF	4.7 tok/s	1370.4 GB
AMD Radeon RX 7600 8GBAMD	FF	0.2 tok/s	1370.4 GB
AMD Radeon RX 7700 XTAMD	FF	0.3 tok/s	1370.4 GB
AMD Radeon RX 7800 XTAMD	FF	0.4 tok/s	1370.4 GB
AMD Radeon RX 7900 XTAMD	FF	0.5 tok/s	1370.4 GB
AMD Radeon RX 7900 XTXAMD	FF	0.6 tok/s	1370.4 GB
AMD Radeon RX 9070AMD	FF	0.4 tok/s	1370.4 GB
AMD Radeon RX 9070 XTAMD	FF	0.4 tok/s	1370.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	FF	0.5 tok/s	1370.4 GB
Apple M4Apple	FF	0.1 tok/s	1370.4 GB
Apple M4 Max (40-core GPU)Apple	FF	0.3 tok/s	1370.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.2 tok/s	1370.4 GB
Apple M5Apple	FF	0.1 tok/s	1370.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	0.4 tok/s	1370.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	FF	0.2 tok/s	1370.4 GB
Apple Mac Mini (M1, 2020)Apple	FF	0.0 tok/s	1370.4 GB
Apple Mac Mini (M2, 2023)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M4, 2024)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	FF	0.2 tok/s	1370.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	FF	0.2 tok/s	1370.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost running this model locally vs flagship API pricing.

Source	Cost per 1M tokens
No featured GPU in our directory fits this model — check back as we add more hardware.
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 1370 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Meta’s Llama 4 Scout is a 109B parameter Mixture of Experts (MoE) model designed to bridge the gap between high-performance dense models and the efficiency required for local deployments. As the first model in the Llama lineage to feature native multimodality and a massive 10-million-token context window, Scout is positioned as a powerhouse for long-form document analysis, complex codebase reasoning, and vision-integrated workflows.

While the total parameter count sits at 109B, the MoE architecture ensures that only 17B parameters are active during any single inference pass. This allows the model to deliver the reasoning capabilities of a 100B+ model with the throughput speeds typically associated with much smaller architectures. Trained on data with a cutoff of August 2024, Llama 4 Scout is optimized for practitioners who need a high-reasoning local AI model that can fit within professional workstation hardware, such as a single NVIDIA H100 or multi-GPU consumer setups.

Architecture & Technical Details

The defining characteristic of Llama 4 Scout is its Mixture of Experts (MoE) design. Unlike dense models where every parameter is activated for every token, Scout utilizes 16 distinct experts. For each token processed, the model routes the workload to the most relevant experts, resulting in only 17B active parameters.

This architecture provides two distinct advantages for local practitioners:

Inference Speed: Because the compute requirement is tied to the 17B active parameters, Llama 4 Scout tokens per second are significantly higher than traditional 70B or 100B dense models.
Intelligence Density: The 109B total parameters allow the model to retain a vast amount of "knowledge" and nuance across diverse domains—including vision and multilingual tasks—without the linear increase in compute cost.

The 10M Token Context Window

Llama 4 Scout introduces a 10,000,000 token context window. This is a generational leap over Llama 3.1’s 128k window. This massive capacity enables native RAG-less (Retrieval-Augmented Generation) workflows where entire technical libraries, thousands of high-resolution images, or massive datasets can be loaded directly into the KV cache for immediate reasoning.

Native Multimodality

Unlike previous iterations that relied on adapter-based vision modules, Scout features native multimodality. The vision and text encoders are deeply integrated, allowing the model to reason across visual and textual data simultaneously. This is critical for tasks like architectural diagram analysis, OCR on complex forms, and spatial reasoning within video frames.

Capabilities & Use Cases

Llama 4 Scout is not a general-purpose "chat" model; it is a reasoning engine. Its training emphasizes instruction-following and logic, making it particularly effective for technical pipelines.

Llama 4 Scout for Coding

With its 10M context window, Llama 4 Scout for coding excels at repository-level understanding. You can feed the model an entire codebase to identify architectural bottlenecks, perform security audits, or refactor legacy code across hundreds of files. Its performance on Llama 4 Scout reasoning benchmarks indicates a significant improvement in logic-heavy tasks like debugging complex asynchronous logic in Rust or C++.

Vision and Document Processing

The native vision capabilities allow Scout to act as a sophisticated document processor. It can handle:

Complex OCR: Extracting data from non-standard layouts, handwritten notes, and low-quality scans.
Visual Reasoning: Interpreting charts, heatmaps, and technical blueprints to provide actionable summaries.
Function Calling: Scout can identify UI elements in a screenshot and generate the corresponding code to interact with them via API or automation scripts.

Multilingual and Math

The August 2024 training cutoff ensures the model is current with modern mathematical notation and diverse linguistic nuances. It supports over 30 languages with high proficiency, making it a viable choice for local translation and localization tasks that require high context retention.

Running Llama 4 Scout Locally

To run Llama 4 Scout locally, you must distinguish between compute requirements and memory requirements. While the 17B active parameters make it fast, the full 109B parameters must reside in VRAM for the model to function without significant offloading penalties.

Llama 4 Scout VRAM Requirements

VRAM is the primary bottleneck for this model. Because it is a 109B parameter model, the memory footprint is substantial:

FP16 (Unquantized): ~218 GB (Requires an enterprise cluster or 3x A100/H100 80GB).
Q8_0 (8-bit): ~115 GB (Requires 2x A6000s or a Mac Studio with 128GB+ RAM).
Q4_K_M (4-bit): ~68 GB (The "sweet spot" for high-end consumer and prosumer setups).
Q3_K_L (3-bit): ~52 GB (Allows for some context headroom on 2x RTX 3090/4090 setups).

Best GPU for Llama 4 Scout

For most practitioners, the best GPU for Llama 4 Scout depends on your budget:

The Professional Choice: A single NVIDIA H100 (80GB) or A6000 (48GB) x2. This allows for Q4 or Q5 quantization with enough overhead for the KV cache.
The Prosumer Choice: 3x NVIDIA RTX 3090 or 4090 (24GB each). This 72GB VRAM total is ideal for running the model at 4-bit quantization (Q4_K_M).
The Unified Memory Choice: Mac Studio with M2/M3/M4 Ultra and at least 128GB of Unified Memory. This is the most seamless way to run 109B model on consumer-adjacent hardware without the complexities of multi-GPU PCIe lane bottlenecks.

Recommended Quantization

For daily use, the best quantization for Llama 4 Scout is Q4_K_M GGUF or 4.0bpw EXL2. At 4-bit, the model retains nearly all its reasoning capabilities while fitting into a ~68GB footprint. If you are prioritizing the 10M context window, you may need to drop to Q3_K_S to leave room for the massive KV cache, which grows linearly with context usage.

Getting Started with Ollama

The quickest way to deploy is via Ollama. Once your hardware is configured, you can pull the model directly:

ollama run llama4-scout:109b-q4_K_M

How It Compares

When evaluating Llama 4 Scout hardware requirements and performance, it is helpful to compare it against its closest competitors in the 100B+ parameter range.

Llama 4 Scout vs. Mixtral 8x22B

Mixtral 8x22B is the other major player in the open-weights MoE space.

Context: Scout wins decisively with 10M tokens vs. Mixtral’s 64k.
Modality: Scout has native vision; Mixtral 8x22B is text-only (unless using a vision adapter).
Efficiency: Scout’s 17B active parameters make it feel snappier during inference compared to Mixtral’s 39B active parameters.

Llama 4 Scout vs. Llama 3.1 70B

While Llama 3.1 70B is a dense model and easier to fit on 2x RTX 3090s (48GB VRAM), Scout offers a significant jump in intelligence.

Reasoning: Scout performs better on multi-step logic and complex instruction following.
Speed: Despite being a larger model (109B vs 70B), Scout’s Llama 4 Scout MoE efficiency allows it to match or exceed the tokens per second of the 70B dense model on equivalent hardware.
VRAM: The 70B model is significantly more accessible for users with only 48GB of VRAM. If you cannot reach the ~68GB threshold for Scout, the 70B remains the better local choice.

Llama 4 Scout is the definitive local AI model 109B parameters 2025 choice for users who need a "big model" experience without the latency of a massive dense architecture. If you have the VRAM to support it, the combination of 10M context and native vision makes it one of the most versatile tools available for local deployment.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 4 Scout

Meta's efficient MoE with 17B active / 16 experts. First Llama with native multimodality and 10M token context window. Fits on a single H100.

109B paramsMoE10000K ctxText + Vision

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A workable 109B-parameter MoE language model from Meta. Pulls ahead on competition math (AIME 2026) (85/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA GB200 NVL72 Rack SystemCheapest card in our directory with comfortable headroom (13824 GB) for this model at Q4 (~1370.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters109B

Active Params17B

ArchitectureMoE

Context Length10M tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderMeta

Download Size217.3 GB

Community

Monthly Downloads407.9K

Likes1.3K

Last Updated12 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama4:scout

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 4 Community LicenseView Full License

Performance & Scoring

Benchmarks

73.0

55.0

12.0

85.0

5.2

67.1

Overall Score

44.3CC

Benchmark40%

49.5

Popularity25%

50.3

Efficiency20%

0.0

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	1366.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	1370.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	1372.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	1374.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	1378.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	1394.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	0.3 tok/s	1370.4 GB
Acer Veriton GN100 AI MiniAcer	FF	0.2 tok/s	1370.4 GB
AMD Instinct MI300XAMD	FF	3.1 tok/s	1370.4 GB
AMD Instinct MI325XAMD	FF	3.5 tok/s	1370.4 GB
AMD Instinct MI355XAMD	FF	4.7 tok/s	1370.4 GB
AMD Radeon RX 7600 8GBAMD	FF	0.2 tok/s	1370.4 GB
AMD Radeon RX 7700 XTAMD	FF	0.3 tok/s	1370.4 GB
AMD Radeon RX 7800 XTAMD	FF	0.4 tok/s	1370.4 GB
AMD Radeon RX 7900 XTAMD	FF	0.5 tok/s	1370.4 GB
AMD Radeon RX 7900 XTXAMD	FF	0.6 tok/s	1370.4 GB
AMD Radeon RX 9070AMD	FF	0.4 tok/s	1370.4 GB
AMD Radeon RX 9070 XTAMD	FF	0.4 tok/s	1370.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	FF	0.5 tok/s	1370.4 GB
Apple M4Apple	FF	0.1 tok/s	1370.4 GB
Apple M4 Max (40-core GPU)Apple	FF	0.3 tok/s	1370.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.2 tok/s	1370.4 GB
Apple M5Apple	FF	0.1 tok/s	1370.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	0.4 tok/s	1370.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	FF	0.2 tok/s	1370.4 GB
Apple Mac Mini (M1, 2020)Apple	FF	0.0 tok/s	1370.4 GB
Apple Mac Mini (M2, 2023)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M4, 2024)Apple	FF	0.1 tok/s	1370.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	FF	0.2 tok/s	1370.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	FF	0.2 tok/s	1370.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost running this model locally vs flagship API pricing.

Source	Cost per 1M tokens
No featured GPU in our directory fits this model — check back as we add more hardware.
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 1370 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Architecture & Technical Details

This architecture provides two distinct advantages for local practitioners:

Inference Speed: Because the compute requirement is tied to the 17B active parameters, Llama 4 Scout tokens per second are significantly higher than traditional 70B or 100B dense models.
Intelligence Density: The 109B total parameters allow the model to retain a vast amount of "knowledge" and nuance across diverse domains—including vision and multilingual tasks—without the linear increase in compute cost.

The 10M Token Context Window

Native Multimodality

Capabilities & Use Cases

Llama 4 Scout is not a general-purpose "chat" model; it is a reasoning engine. Its training emphasizes instruction-following and logic, making it particularly effective for technical pipelines.

Llama 4 Scout for Coding

Vision and Document Processing

The native vision capabilities allow Scout to act as a sophisticated document processor. It can handle:

Complex OCR: Extracting data from non-standard layouts, handwritten notes, and low-quality scans.
Visual Reasoning: Interpreting charts, heatmaps, and technical blueprints to provide actionable summaries.
Function Calling: Scout can identify UI elements in a screenshot and generate the corresponding code to interact with them via API or automation scripts.

Multilingual and Math

Running Llama 4 Scout Locally

Llama 4 Scout VRAM Requirements

VRAM is the primary bottleneck for this model. Because it is a 109B parameter model, the memory footprint is substantial:

FP16 (Unquantized): ~218 GB (Requires an enterprise cluster or 3x A100/H100 80GB).
Q8_0 (8-bit): ~115 GB (Requires 2x A6000s or a Mac Studio with 128GB+ RAM).
Q4_K_M (4-bit): ~68 GB (The "sweet spot" for high-end consumer and prosumer setups).
Q3_K_L (3-bit): ~52 GB (Allows for some context headroom on 2x RTX 3090/4090 setups).

Best GPU for Llama 4 Scout

For most practitioners, the best GPU for Llama 4 Scout depends on your budget:

The Professional Choice: A single NVIDIA H100 (80GB) or A6000 (48GB) x2. This allows for Q4 or Q5 quantization with enough overhead for the KV cache.
The Prosumer Choice: 3x NVIDIA RTX 3090 or 4090 (24GB each). This 72GB VRAM total is ideal for running the model at 4-bit quantization (Q4_K_M).
The Unified Memory Choice: Mac Studio with M2/M3/M4 Ultra and at least 128GB of Unified Memory. This is the most seamless way to run 109B model on consumer-adjacent hardware without the complexities of multi-GPU PCIe lane bottlenecks.

Recommended Quantization

Getting Started with Ollama

The quickest way to deploy is via Ollama. Once your hardware is configured, you can pull the model directly:

ollama run llama4-scout:109b-q4_K_M

How It Compares

When evaluating Llama 4 Scout hardware requirements and performance, it is helpful to compare it against its closest competitors in the 100B+ parameter range.

Llama 4 Scout vs. Mixtral 8x22B

Mixtral 8x22B is the other major player in the open-weights MoE space.

Context: Scout wins decisively with 10M tokens vs. Mixtral’s 64k.
Modality: Scout has native vision; Mixtral 8x22B is text-only (unless using a vision adapter).
Efficiency: Scout’s 17B active parameters make it feel snappier during inference compared to Mixtral’s 39B active parameters.

Llama 4 Scout vs. Llama 3.1 70B

While Llama 3.1 70B is a dense model and easier to fit on 2x RTX 3090s (48GB VRAM), Scout offers a significant jump in intelligence.

Reasoning: Scout performs better on multi-step logic and complex instruction following.
Speed: Despite being a larger model (109B vs 70B), Scout’s Llama 4 Scout MoE efficiency allows it to match or exceed the tokens per second of the 70B dense model on equivalent hardware.
VRAM: The 70B model is significantly more accessible for users with only 48GB of VRAM. If you cannot reach the ~68GB threshold for Scout, the 70B remains the better local choice.