Meta

Llama 3.3 70B Instruct

Meta's updated 70B dense model claiming Llama 3.1 405B-level performance. Strong all-around for chat, code, and reasoning. 128K context.

70B paramsDense128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onNVIDIA H200 SXM 141GBCheapest card in our directory with comfortable headroom (141 GB) for this model at Q4 (~112.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Summarization

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length128K tokens

ModalityText Only

Training CutoffDecember 2023

ProviderMeta

Download Size282.3 GB

Community

Monthly Downloads980.7K

Likes2.8K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama3.3:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 3.3 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

66.2

Overall Score

57.9BB

Benchmark40%

66.2

Popularity25%

69.0

Efficiency20%

14.8

Versatility15%

74.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	98.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	119.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	128.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	145.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	212.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google TPU v7 (Ironwood)Google	SS	52.7 tok/s	112.8 GB
NVIDIA B200 GPUNVIDIA	SS	57.1 tok/s	112.8 GB
AMD Instinct MI325XAMD	SS	42.8 tok/s	112.8 GB
AMD Instinct MI300XAMD	SS	37.8 tok/s	112.8 GB
AMD Instinct MI355XAMD	SS	57.1 tok/s	112.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.3 tok/s	112.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	50.7 tok/s	112.8 GB
Dell Pro Max with GB300Dell	SS	50.7 tok/s	112.8 GB
HP ZGX Fury AI StationHP	SS	50.7 tok/s	112.8 GB
MSI XpertStation WS300MSI	SS	50.7 tok/s	112.8 GB
SuperMicro Super AI StationSuperMicro	SS	50.7 tok/s	112.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	50.7 tok/s	112.8 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.4 tok/s	112.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.8 tok/s	112.8 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.8 tok/s	112.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.8 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.8 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.8 GB
Acer Veriton GN100 AI MiniAcer	BB	1.9 tok/s	112.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Acer Veriton GN100 AI Mini (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 3.3 70B Instruct on Acer Veriton GN100 AI Mini · ~1.9 tok/s · 140W	$2.39
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 113 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$2.58

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Llama 3.3 70B Instruct is Meta’s most efficient high-performance model to date, designed to deliver the capabilities of the massive Llama 3.1 405B model within a much more accessible 70B parameter footprint. Released in late 2024, this model represents the current state-of-the-art for the 70B class, specifically tuned for dialogue, reasoning, and complex instruction-following. For local practitioners, it sits in the "prosumer" sweet spot: it is too large for a single consumer GPU at high precision, but it is the primary target for dual-GPU workstations and high-end Mac Studio configurations.

Unlike its predecessors, Llama 3.3 70B Instruct is not just an incremental update. Meta has refined the training recipe to push Llama 3.3 70B Instruct performance to levels that previously required significantly more compute. It competes directly with proprietary models like GPT-4o and Claude 3.5 Sonnet in several benchmarks, making it the premier choice for developers who need GPT-4 class intelligence without the privacy risks or latency of a cloud API. If you are looking for a local AI model with 70B parameters in 2025, this is the industry standard.

Architecture & Technical Details

The Llama 3.3 70B Instruct architecture follows a standard dense Transformer decoder-only design. While the industry has seen a shift toward Mixture-of-Experts (MoE) architectures to reduce inference costs, Meta has stuck with a dense 70B parameter model here. This architectural choice ensures consistent performance across all tokens and avoids the "expert routing" complexities found in models like Mixtral.

Key Specifications:

Parameters: 70 Billion (Dense)
Context Window: 128,000 tokens
Attention Mechanism: Grouped Query Attention (GQA) for improved inference scalability
Tokenizer: 128k token vocabulary (same as Llama 3.1)
Training Cutoff: December 2023
License: Llama 3.3 Community License (Permissive for most use cases under 700M monthly active users)

The 128k context window is a critical feature for local deployments. It allows for massive RAG (Retrieval-Augmented Generation) pipelines where entire technical documentations or code repositories can be injected into the prompt. Because it uses GQA, the memory overhead for the KV (Key-Value) cache is significantly reduced compared to standard Multi-Head Attention, though at 128k context, the KV cache will still consume a substantial amount of VRAM.

Capabilities & Use Cases

Llama 3.3 70B Instruct is a general-purpose powerhouse. Its instruction-following is significantly more robust than the Llama 3.1 70B, with fewer instances of "refusal" on complex but safe prompts and better adherence to system prompts.

Llama 3.3 70B Instruct for Coding

This model is a top-tier choice for local software development. It excels at boilerplate generation, debugging, and explaining complex architectural patterns. In a Llama 3.3 70B Instruct reasoning benchmark context, it handles logical branching and multi-step coding tasks with a level of nuance that smaller 8B or 14B models lack. It supports all major programming languages, including Python, Rust, C++, and TypeScript, and is particularly effective when used with local IDE integrations like Continue or Aider.

Agentic Workflows and Function-Calling

With native support for function-calling, Llama 3.3 70B is built for "agentic" use cases. It can reliably output structured JSON and determine when to call external tools (like a web search or a database query). This makes it the ideal "brain" for a local AI agent that needs to interact with your local file system or private APIs.

Multilingual Support and Summarization

The model is fine-tuned for high proficiency in English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. For practitioners handling multilingual datasets, the model maintains its reasoning capabilities across these languages. Its massive context window also makes it a superior tool for long-form summarization, capable of processing a 100-page PDF in a single pass.

Running Llama 3.3 70B Instruct Locally

To run Llama 3.3 70B Instruct locally, you must account for the sheer size of the weights. A 70B parameter model in 16-bit precision (FP16) requires approximately 140GB of VRAM just to load the weights, which is beyond the reach of any consumer setup. However, through quantization, this model becomes highly performant on enthusiast hardware.

Llama 3.3 70B Instruct VRAM Requirements

The amount of VRAM you need depends entirely on your chosen quantization level:

Q8_0 (8-bit): ~75GB VRAM. Requires a Mac Studio (64GB+ RAM) or a multi-GPU server (e.g., 4x RTX 3090/4090).
Q4_K_M (4-bit): ~43GB VRAM. This is the best quantization for Llama 3.3 70B Instruct for most users. It fits comfortably on 2x RTX 3090/4090 (48GB total VRAM) with room for a moderate context window.
Q3_K_M (3-bit): ~32GB VRAM. This allows you to run the model on a single RTX 6000 Ada or a Mac with 36GB+ RAM, though you will see a slight degradation in complex reasoning.

Recommended Hardware

The Dual-GPU Build: 2x NVIDIA RTX 3090 or 4090 (24GB each). This is the gold standard for running 70B models. Using NVLink (on 3090s) or high-speed PCIe lanes, you can achieve impressive Llama 3.3 70B Instruct tokens per second (roughly 12–18 t/s).
The Apple Silicon Route: A Mac Studio or MacBook Pro with an M2/M3/M4 Max or Ultra chip and at least 64GB of Unified Memory. This is the most seamless way to run the model with large context windows (up to 128k), as the unified memory can be shared between the CPU and GPU.
The Budget Route: 3x or 4x used RTX 3060 12GB cards. While slower, this provides the 48GB+ VRAM ceiling necessary for 4-bit quantization at a fraction of the cost.

Getting Started with Ollama

The fastest way to deploy is via Ollama. Once installed, you can run:

ollama run llama3.3

This will default to a 4-bit quantized version, which is the most balanced approach for Llama 3.3 70B Instruct hardware requirements.

How It Compares

When deciding whether to deploy Llama 3.3 70B Instruct, it is helpful to compare it against its closest rivals in the open-weight space.

Llama 3.3 70B Instruct vs. Qwen 2.5 72B

Qwen 2.5 72B (from Alibaba) is perhaps the strongest competitor.

Reasoning: Llama 3.3 generally has a slight edge in "common sense" reasoning and instruction following.
Coding/Math: Qwen 2.5 often outperforms Llama in pure mathematics and specialized coding benchmarks.
Language: Llama is superior for Western European languages, while Qwen is the clear winner for CJK (Chinese, Japanese, Korean) languages.

Llama 3.3 70B Instruct vs. Mistral Large 2 (123B)

Mistral Large 2 is a significantly larger model, but Llama 3.3 70B punches surprisingly close to its weight class.

Efficiency: Llama 3.3 is much easier to run. You can fit a quantized Llama 3.3 70B on 48GB of VRAM, whereas Mistral Large 2 requires nearly 80GB even at 4-bit quantization.
Performance: For most chat and summarization tasks, the difference is negligible. You would only choose Mistral Large 2 if you require its specific multilingual strengths or slightly deeper reasoning for extremely complex logic puzzles.

Llama 3.3 70B Instruct vs. Llama 3.1 70B

There is almost no reason to use the older 3.1 version unless you have a specific fine-tune that hasn't been ported yet. Llama 3.3 is essentially a "drop-in" replacement that offers higher intelligence and better benchmark scores for the exact same VRAM cost.

In summary, Llama 3.3 70B Instruct is the definitive choice for local practitioners who have moved beyond the limitations of 8B models and have the hardware to support a 48GB VRAM footprint. It offers a "no-compromise" local AI experience, providing the reasoning depth and context handling required for professional-grade AI applications.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 3.3 70B Instruct

Meta's updated 70B dense model claiming Llama 3.1 405B-level performance. Strong all-around for chat, code, and reasoning. 128K context.

70B paramsDense128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onNVIDIA H200 SXM 141GBCheapest card in our directory with comfortable headroom (141 GB) for this model at Q4 (~112.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Summarization

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length128K tokens

ModalityText Only

Training CutoffDecember 2023

ProviderMeta

Download Size282.3 GB

Community

Monthly Downloads980.7K

Likes2.8K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama3.3:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 3.3 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

66.2

Overall Score

57.9BB

Benchmark40%

66.2

Popularity25%

69.0

Efficiency20%

14.8

Versatility15%

74.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	98.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	119.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	128.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	145.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	212.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google TPU v7 (Ironwood)Google	SS	52.7 tok/s	112.8 GB
NVIDIA B200 GPUNVIDIA	SS	57.1 tok/s	112.8 GB
AMD Instinct MI325XAMD	SS	42.8 tok/s	112.8 GB
AMD Instinct MI300XAMD	SS	37.8 tok/s	112.8 GB
AMD Instinct MI355XAMD	SS	57.1 tok/s	112.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.3 tok/s	112.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	50.7 tok/s	112.8 GB
Dell Pro Max with GB300Dell	SS	50.7 tok/s	112.8 GB
HP ZGX Fury AI StationHP	SS	50.7 tok/s	112.8 GB
MSI XpertStation WS300MSI	SS	50.7 tok/s	112.8 GB
SuperMicro Super AI StationSuperMicro	SS	50.7 tok/s	112.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	50.7 tok/s	112.8 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.4 tok/s	112.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.8 tok/s	112.8 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.8 tok/s	112.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.8 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.8 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.8 GB
Acer Veriton GN100 AI MiniAcer	BB	1.9 tok/s	112.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Acer Veriton GN100 AI Mini (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 3.3 70B Instruct on Acer Veriton GN100 AI Mini · ~1.9 tok/s · 140W	$2.39
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 113 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$2.58

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key Specifications:

Parameters: 70 Billion (Dense)
Context Window: 128,000 tokens
Attention Mechanism: Grouped Query Attention (GQA) for improved inference scalability
Tokenizer: 128k token vocabulary (same as Llama 3.1)
Training Cutoff: December 2023
License: Llama 3.3 Community License (Permissive for most use cases under 700M monthly active users)

Capabilities & Use Cases

Llama 3.3 70B Instruct for Coding

Agentic Workflows and Function-Calling

Multilingual Support and Summarization

Running Llama 3.3 70B Instruct Locally

Llama 3.3 70B Instruct VRAM Requirements

The amount of VRAM you need depends entirely on your chosen quantization level:

Q8_0 (8-bit): ~75GB VRAM. Requires a Mac Studio (64GB+ RAM) or a multi-GPU server (e.g., 4x RTX 3090/4090).
Q4_K_M (4-bit): ~43GB VRAM. This is the best quantization for Llama 3.3 70B Instruct for most users. It fits comfortably on 2x RTX 3090/4090 (48GB total VRAM) with room for a moderate context window.
Q3_K_M (3-bit): ~32GB VRAM. This allows you to run the model on a single RTX 6000 Ada or a Mac with 36GB+ RAM, though you will see a slight degradation in complex reasoning.

Recommended Hardware

The Dual-GPU Build: 2x NVIDIA RTX 3090 or 4090 (24GB each). This is the gold standard for running 70B models. Using NVLink (on 3090s) or high-speed PCIe lanes, you can achieve impressive Llama 3.3 70B Instruct tokens per second (roughly 12–18 t/s).
The Apple Silicon Route: A Mac Studio or MacBook Pro with an M2/M3/M4 Max or Ultra chip and at least 64GB of Unified Memory. This is the most seamless way to run the model with large context windows (up to 128k), as the unified memory can be shared between the CPU and GPU.
The Budget Route: 3x or 4x used RTX 3060 12GB cards. While slower, this provides the 48GB+ VRAM ceiling necessary for 4-bit quantization at a fraction of the cost.

Getting Started with Ollama

The fastest way to deploy is via Ollama. Once installed, you can run:

ollama run llama3.3

This will default to a 4-bit quantized version, which is the most balanced approach for Llama 3.3 70B Instruct hardware requirements.

How It Compares

When deciding whether to deploy Llama 3.3 70B Instruct, it is helpful to compare it against its closest rivals in the open-weight space.

Llama 3.3 70B Instruct vs. Qwen 2.5 72B

Qwen 2.5 72B (from Alibaba) is perhaps the strongest competitor.

Reasoning: Llama 3.3 generally has a slight edge in "common sense" reasoning and instruction following.
Coding/Math: Qwen 2.5 often outperforms Llama in pure mathematics and specialized coding benchmarks.
Language: Llama is superior for Western European languages, while Qwen is the clear winner for CJK (Chinese, Japanese, Korean) languages.

Llama 3.3 70B Instruct vs. Mistral Large 2 (123B)

Mistral Large 2 is a significantly larger model, but Llama 3.3 70B punches surprisingly close to its weight class.

Efficiency: Llama 3.3 is much easier to run. You can fit a quantized Llama 3.3 70B on 48GB of VRAM, whereas Mistral Large 2 requires nearly 80GB even at 4-bit quantization.
Performance: For most chat and summarization tasks, the difference is negligible. You would only choose Mistral Large 2 if you require its specific multilingual strengths or slightly deeper reasoning for extremely complex logic puzzles.