Meta

Llama 3 70B Instruct

Meta's third-gen open model. Trained on 15T tokens. Significant leap over Llama 2 in reasoning and coding.

70B paramsDense8K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A workable 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~45.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Multilingual

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length8K tokens

ModalityText Only

Training CutoffDecember 2023

ProviderMeta

Download Size282.2 GB

Community

Monthly Downloads110.1K

Likes1.5K

Last Updated11 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama3:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 3 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

58.1

Overall Score

50.0CC

Benchmark40%

58.1

Popularity25%

43.4

Efficiency20%

47.5

Versatility15%

42.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	31.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	45.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	52.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	61.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	78.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	145.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	59.0 tok/s	45.7 GB
Google Cloud TPU v5pGoogle	SS	48.7 tok/s	45.7 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	43.2 tok/s	45.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	65.2 tok/s	45.7 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	35.9 tok/s	45.7 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	84.6 tok/s	45.7 GB
AMD Instinct MI300XAMD	SS	93.4 tok/s	45.7 GB
Google TPU v7 (Ironwood)Google	SS	130.0 tok/s	45.7 GB
NVIDIA B200 GPUNVIDIA	SS	141.0 tok/s	45.7 GB
AMD Instinct MI325XAMD	SS	105.7 tok/s	45.7 GB
AMD Instinct MI355XAMD	SS	141.0 tok/s	45.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	125.1 tok/s	45.7 GB
Dell Pro Max with GB300Dell	SS	125.1 tok/s	45.7 GB
HP ZGX Fury AI StationHP	SS	125.1 tok/s	45.7 GB
MSI XpertStation WS300MSI	SS	125.1 tok/s	45.7 GB
SuperMicro Super AI StationSuperMicro	SS	125.1 tok/s	45.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	125.1 tok/s	45.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.1 tok/s	45.7 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.0 tok/s	45.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.0 tok/s	45.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.8 tok/s	45.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.1 tok/s	45.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.8 tok/s	45.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.8 tok/s	45.7 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.0 tok/s	45.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 3 70B Instruct on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.5 tok/s · 150W	$1.11
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 46 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Meta’s Llama 3 70B Instruct represents the current high-water mark for open-weights models in the 70B parameter class. Trained on a massive 15 trillion token dataset—seven times larger than that of Llama 2—this model is designed to compete directly with proprietary models like GPT-4 in reasoning, coding, and instruction following. For practitioners, Llama 3 70B Instruct is the primary choice for local deployments where high-tier reasoning is required but data privacy or latency requirements preclude the use of cloud APIs.

While the 8B version of Llama 3 is suitable for edge devices and mid-range laptops, the 70B variant is a "heavyweight" model that requires significant VRAM to operate. It occupies a critical niche in the local AI ecosystem: it is small enough to fit on high-end consumer hardware (like multi-GPU setups or Mac Studios) while being sophisticated enough to handle complex agentic workflows and multi-step logical deductions that smaller models fail to execute reliably.

Architecture & Technical Details

Llama 3 70B Instruct utilizes a standard dense transformer architecture. Unlike Mixture of Experts (MoE) models that only activate a fraction of their parameters during inference, Llama 3 70B is a dense model where all 70 billion parameters are active for every token generated. This results in high-quality output but places a higher demand on compute and memory bandwidth.

A key technical refinement in this generation is the implementation of Grouped Query Attention (GQA). This improves inference efficiency by reducing the memory overhead of the KV (Key-Value) cache, which is particularly beneficial when handling the model's 8,192-token context length. While 8k tokens is shorter than some contemporary competitors, Meta’s focus was on maximizing the "information density" and reasoning quality within that window.

The model also features a significantly upgraded tokenizer with a 128k vocabulary. This allows for more efficient text encoding, resulting in better performance across various languages and a measurable increase in processing speed compared to the Llama 2 architecture. The training cutoff of December 2023 ensures the model has a relatively modern understanding of software libraries and global events.

Capabilities & Use Cases

Llama 3 70B Instruct performance is characterized by its high "steerability." It follows complex, multi-part instructions with a level of nuance that 7B or 13B models cannot match.

Llama 3 70B Instruct for Coding

This model is a highly capable pair programmer. It excels at Python, JavaScript, C++, and Rust, handling everything from boilerplate generation to complex debugging. Because it was trained on a significantly larger code corpus than its predecessor, it understands modern framework patterns and can refactor code while maintaining logic across several functions. For local developers, it serves as a private alternative to GitHub Copilot that doesn't require an internet connection.

Reasoning and Logic

The Llama 3 70B Instruct reasoning benchmark scores place it at the top of its weight class. It is particularly effective for:

Chain-of-Thought Reasoning: Breaking down complex logic puzzles or mathematical problems.
Agentic Workflows: Acting as the "brain" for local agents that need to use tools, call APIs, or browse local files.
Structured Data Extraction: Converting messy, unstructured text into clean JSON or Markdown formats without losing context.

Multilingual Support

While Meta focused on English for the primary benchmarks, the model demonstrates strong proficiency in over 30 languages, including German, French, Italian, Portuguese, Spanish, and Chinese. It is a viable choice for local translation tasks and multilingual sentiment analysis.

Running Llama 3 70B Instruct Locally

To run Llama 3 70B Instruct locally, you must account for the massive memory footprint of 70 billion parameters. In a full FP16 (16-bit) precision state, the model requires approximately 140GB of VRAM, which is beyond the reach of consumer hardware. However, through quantization, the model becomes accessible to enthusiasts and professionals.

Llama 3 70B Instruct VRAM Requirements

VRAM is the primary bottleneck for this model. Here is the breakdown of the Llama 3 70B Instruct hardware requirements based on common quantization levels (GGUF/EXL2):

Q2_K (2-bit): ~26GB VRAM. (Significant logic degradation; not recommended).
Q4_K_M (4-bit): ~43GB - 48GB VRAM. This is the best quantization for Llama 3 70B Instruct as it maintains near-FP16 performance with a massive reduction in memory.
Q6_K (6-bit): ~58GB VRAM. Highly accurate, but requires enterprise hardware or massive Mac Unified Memory.
Q8_0 (8-bit): ~75GB VRAM.

Best GPU for Llama 3 70B Instruct

For Windows or Linux users, the most cost-effective way to run this model at 4-bit precision is a dual RTX 3090 or dual RTX 4090 setup. By pooling the VRAM of two 24GB cards (totaling 48GB), you can fit a Q4_K_M quantization comfortably with enough room for a decent KV cache.

If you are wondering how to run 70B model on consumer GPU setups that have less than 48GB of VRAM, you will have to resort to "offloading" layers to system RAM. This will drastically reduce your Llama 3 70B Instruct tokens per second, often dropping performance from 10–15 t/s down to 1–2 t/s.

For Mac users, an M2/M3/M4 Max or Ultra with at least 64GB of Unified Memory is the ideal environment. The M4 Max, in particular, offers the memory bandwidth required to make the model feel snappy and responsive.

Getting Started with Ollama

The quickest way to deploy is via Ollama. Once installed, you can run the model with a single command:

ollama run llama3:70b

Ollama defaults to a 4-bit quantization, making it the standard entry point for testing local performance.

How It Compares

When evaluating local AI model 70B parameters 2025 options, Llama 3 70B Instruct is usually compared against Mixtral 8x7B and Qwen2 72B.

Llama 3 70B Instruct vs Mixtral 8x7B

Mixtral 8x7B is a Mixture of Experts (MoE) model. While Mixtral is faster to run (it only uses ~13B active parameters per token), Llama 3 70B Instruct generally provides superior reasoning and "common sense" logic. Mixtral requires less compute power but can be finicky with instruction following compared to the highly refined Llama 3 Instruct tunes.

Llama 3 70B Instruct vs Qwen2 72B

Qwen2 72B (by Alibaba) is one of the few models that consistently challenges Llama 3 70B on coding and math benchmarks. Qwen2 often performs better in multilingual tasks (specifically Asian languages) and has a much larger context window (128k). However, Llama 3 70B Instruct remains the industry standard for English-centric workflows due to its massive ecosystem support and the quality of its fine-tunes.

Choosing Llama 3 70B Instruct is a bet on the most supported open-weights ecosystem in existence. Whether you are building a local RAG (Retrieval-Augmented Generation) system or a private coding assistant, this model provides the necessary intelligence to handle production-grade tasks on local hardware.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 3 70B Instruct

Meta's third-gen open model. Trained on 15T tokens. Significant leap over Llama 2 in reasoning and coding.

70B paramsDense8K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A workable 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~45.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Multilingual

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length8K tokens

ModalityText Only

Training CutoffDecember 2023

ProviderMeta

Download Size282.2 GB

Community

Monthly Downloads110.1K

Likes1.5K

Last Updated11 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama3:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 3 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

58.1

Overall Score

50.0CC

Benchmark40%

58.1

Popularity25%

43.4

Efficiency20%

47.5

Versatility15%

42.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	31.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	45.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	52.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	61.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	78.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	145.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	59.0 tok/s	45.7 GB
Google Cloud TPU v5pGoogle	SS	48.7 tok/s	45.7 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	43.2 tok/s	45.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	65.2 tok/s	45.7 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	35.9 tok/s	45.7 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	84.6 tok/s	45.7 GB
AMD Instinct MI300XAMD	SS	93.4 tok/s	45.7 GB
Google TPU v7 (Ironwood)Google	SS	130.0 tok/s	45.7 GB
NVIDIA B200 GPUNVIDIA	SS	141.0 tok/s	45.7 GB
AMD Instinct MI325XAMD	SS	105.7 tok/s	45.7 GB
AMD Instinct MI355XAMD	SS	141.0 tok/s	45.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	125.1 tok/s	45.7 GB
Dell Pro Max with GB300Dell	SS	125.1 tok/s	45.7 GB
HP ZGX Fury AI StationHP	SS	125.1 tok/s	45.7 GB
MSI XpertStation WS300MSI	SS	125.1 tok/s	45.7 GB
SuperMicro Super AI StationSuperMicro	SS	125.1 tok/s	45.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	125.1 tok/s	45.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.1 tok/s	45.7 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.0 tok/s	45.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.0 tok/s	45.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.8 tok/s	45.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.1 tok/s	45.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.8 tok/s	45.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.8 tok/s	45.7 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.0 tok/s	45.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 3 70B Instruct on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.5 tok/s · 150W	$1.11
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 46 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Llama 3 70B Instruct performance is characterized by its high "steerability." It follows complex, multi-part instructions with a level of nuance that 7B or 13B models cannot match.

Llama 3 70B Instruct for Coding

Reasoning and Logic

The Llama 3 70B Instruct reasoning benchmark scores place it at the top of its weight class. It is particularly effective for:

Chain-of-Thought Reasoning: Breaking down complex logic puzzles or mathematical problems.
Agentic Workflows: Acting as the "brain" for local agents that need to use tools, call APIs, or browse local files.
Structured Data Extraction: Converting messy, unstructured text into clean JSON or Markdown formats without losing context.

Multilingual Support

Running Llama 3 70B Instruct Locally

Llama 3 70B Instruct VRAM Requirements

VRAM is the primary bottleneck for this model. Here is the breakdown of the Llama 3 70B Instruct hardware requirements based on common quantization levels (GGUF/EXL2):

Q2_K (2-bit): ~26GB VRAM. (Significant logic degradation; not recommended).
Q4_K_M (4-bit): ~43GB - 48GB VRAM. This is the best quantization for Llama 3 70B Instruct as it maintains near-FP16 performance with a massive reduction in memory.
Q6_K (6-bit): ~58GB VRAM. Highly accurate, but requires enterprise hardware or massive Mac Unified Memory.
Q8_0 (8-bit): ~75GB VRAM.