Meta

Llama 2 70B Chat

Meta's second-gen open LLM that opened the floodgates for commercial open-weight AI. 70B dense, 4K context. Trained on 2T tokens. On par with ChatGPT/PaLM at launch.

70B paramsDense4K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A workable 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~43.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length4K tokens

ModalityText Only

Training CutoffSeptember 2022

ProviderMeta

Download Size275.9 GB

Community

Monthly Downloads144.5K

Likes2.2K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama2:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 2 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

42.3

Overall Score

41.1CC

Benchmark40%

42.3

Popularity25%

50.8

Efficiency20%

37.7

Versatility15%

26.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	28.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	43.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	50.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	76.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	142.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	62.1 tok/s	43.4 GB
Google Cloud TPU v5pGoogle	SS	51.3 tok/s	43.4 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	45.5 tok/s	43.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	37.8 tok/s	43.4 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	68.6 tok/s	43.4 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	89.0 tok/s	43.4 GB
AMD Instinct MI300XAMD	SS	98.3 tok/s	43.4 GB
Google TPU v7 (Ironwood)Google	SS	136.9 tok/s	43.4 GB
NVIDIA B200 GPUNVIDIA	SS	148.4 tok/s	43.4 GB
AMD Instinct MI325XAMD	SS	111.3 tok/s	43.4 GB
AMD Instinct MI355XAMD	SS	148.4 tok/s	43.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	131.7 tok/s	43.4 GB
Dell Pro Max with GB300Dell	SS	131.7 tok/s	43.4 GB
HP ZGX Fury AI StationHP	SS	131.7 tok/s	43.4 GB
MSI XpertStation WS300MSI	SS	131.7 tok/s	43.4 GB
SuperMicro Super AI StationSuperMicro	SS	131.7 tok/s	43.4 GB
Gigabyte W775-V10-L01Gigabyte	SS	131.7 tok/s	43.4 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.8 tok/s	43.4 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.5 tok/s	43.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.4 tok/s	43.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.8 tok/s	43.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	11.4 tok/s	43.4 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	11.4 tok/s	43.4 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	11.4 tok/s	43.4 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.4 tok/s	43.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 2 70B Chat on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.7 tok/s · 150W	$1.05
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 43 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Llama 2 70B Chat is the flagship dense large language model (LLM) from Meta’s second-generation open-weight series. Released as a direct competitor to proprietary models like GPT-3.5, it represented a massive leap in performance for the open-source community upon its release. Built on a transformer-based architecture with 70 billion parameters, this model was trained on 2 trillion tokens and fine-tuned specifically for dialogue and instruction-following.

While newer models have since entered the market, Llama 2 70B Chat remains a foundational benchmark for local AI practitioners. It occupies the "prosumer" tier of local LLMs—too large for a single standard consumer GPU at full precision, but highly performant on multi-GPU setups or high-end Mac Silicon. It is designed for users who require high-reasoning capabilities, stable instruction-following, and a permissive license for commercial applications.

Architecture & Technical Details

The architecture of Llama 2 70B Chat is a standard decoder-only transformer, but it introduced several optimizations that improved efficiency over the original Llama series. Most notably, the 70B variant utilizes Grouped-Query Attention (GQA). Unlike the smaller 7B and 13B versions of Llama 2, which used Multi-Head Attention (MHA), GQA allows the 70B model to maintain high inference speeds by sharing key and value projections across multiple attention heads. This significantly reduces the memory bandwidth required during the KV cache lookup, which is often the primary bottleneck for Llama 2 70B Chat performance.

The model features a context length of 4,096 tokens. While this is shorter than the 128k context windows seen in 2024 and 2025 models, it is sufficient for standard chat interactions, code generation, and short-form RAG (Retrieval-Augmented Generation) tasks. The model was trained with a cutoff of September 2022, meaning it lacks knowledge of events or software updates after that period unless supplemented with external data through RAG.

Key technical specifications include:

Parameter Count: 70 Billion (Dense)
Training Data: 2 Trillion tokens
Vocabulary Size: 32,000 tokens
Architecture: Transformer with GQA and Rotary Positional Embeddings (RoPE)
Context Window: 4,096 tokens

Capabilities & Use Cases

Llama 2 70B Chat is optimized for instruction-following and multi-turn dialogue. Its scale allows for a level of nuance and "common sense" reasoning that smaller 7B or 13B models typically lack.

Advanced Instruction Following

The model excels at adhering to complex system prompts. If you need a model to maintain a specific persona, follow strict formatting rules (like JSON output), or operate within a narrow set of constraints, the 70B parameter count provides the necessary "brain power" to avoid instruction drift during longer conversations.

Llama 2 70B Chat for Coding

While not a specialized "CodeLlama" variant, the 70B chat model is highly capable of generating Python, C++, and JavaScript. It is effective for debugging existing code or explaining complex architectural patterns. However, for greenfield development of large-scale applications, practitioners often use it as a "reviewer" model to check logic rather than a pure generator.

Content Transformation and Summarization

Because it was trained on a massive corpus of text, the model is excellent at stylistic transformation—rewriting technical documentation for a general audience or summarizing long transcripts. Its 4K context window allows it to digest roughly 3,000 words of input while leaving enough room for a detailed response.

Running Llama 2 70B Chat Locally

To run Llama 2 70B Chat locally, the primary hurdle is the sheer size of the model weights. In its unquantized 16-bit (FP16) state, the model requires approximately 140GB of VRAM. This is beyond the reach of any single consumer or professional workstation GPU like the RTX 6000 Ada. Consequently, quantization is mandatory for almost all local deployments.

Llama 2 70B Chat Hardware Requirements

Your hardware choice depends entirely on the level of quantization you are willing to accept. Quantization reduces the precision of the weights (e.g., from 16-bit to 4-bit), which drastically lowers VRAM usage with a minimal hit to perplexity (intelligence).

Dual-GPU Setup (RTX 3090/4090 24GB x2): This is the gold standard for local practitioners. With 48GB of total VRAM, you can run a Q4_K_M GGUF or a 4-bit EXL2 quantization. This setup fits the model entirely on the GPUs, ensuring high speeds.
Mac Silicon (M2/M3/M4 Max or Ultra): Because Mac Silicon uses unified memory, you can run the 70B model if you have 64GB of RAM or more. An M2 Ultra with 128GB of RAM can run the model at higher precisions (Q6 or Q8) with ease.
Single GPU (RTX 3090/4090 24GB): You cannot run the full 70B model on a single 24GB card at a functional quantization level. You would need to use "system RAM offloading" via llama.cpp, which will result in extremely slow performance (often <1 token per second).

Best Quantization for Llama 2 70B Chat

For most users, the Q4_K_M (4-bit) quantization is the "sweet spot." It reduces the model size to roughly 42GB, fitting comfortably into a 48GB VRAM pool (dual 3090s) while retaining nearly 99% of the original model's intelligence. If you have more memory, Q5_K_M offers a slight improvement, but the diminishing returns are noticeable compared to the increased memory footprint.

Expected Performance (Tokens Per Second)

Llama 2 70B Chat tokens per second vary based on your memory bandwidth:

Dual RTX 3090/4090: 10–15 tokens/sec (very readable, faster than human reading speed).
Mac Studio (M2 Ultra): 8–12 tokens/sec.
CPU with DDR5 RAM: 1–3 tokens/sec (painfully slow for chat, okay for background processing).

How to Get Started

The best GPU for Llama 2 70B Chat in a consumer budget is a pair of used RTX 3090s. To get the model running quickly, Ollama is the recommended tool. After installing Ollama, you can run the model with a single command:

ollama run llama2:70b

This will automatically handle the quantization and memory management for your specific hardware.

How It Compares

When evaluating this model, it is important to compare it against its successor and its primary architectural rival.

Llama 2 70B Chat vs Llama 3 70B

Llama 3 70B is the direct successor and is objectively superior in almost every metric. Llama 3 was trained on 15T tokens (vs 2T for Llama 2) and has a much larger context window (8k to 128k depending on the version). However, Llama 2 70B Chat is still used in legacy pipelines where specific fine-tunes were built on its architecture, or by users who prefer its specific "personality," which some find less prone to the aggressive safety filtering seen in early Llama 3 releases.

Llama 2 70B Chat vs Mixtral 8x7B

Mixtral 8x7B is a Mixture-of-Experts (MoE) model. While it has a similar total parameter count, it only uses about 13B parameters per token during inference.

VRAM: Both require roughly the same VRAM (approx. 24GB–35GB for 4-bit quantizations).
Speed: Mixtral is significantly faster because it processes fewer parameters per token.
Reasoning: Llama 2 70B Chat often feels more "stable" for long-form instruction following, whereas Mixtral 8x7B is superior for coding and handling larger context (32k).

As a local AI model 70B parameters 2025 choice, Llama 2 70B remains a viable option for those running older hardware configurations or those specifically looking to study the evolution of Meta's fine-tuning methodology. While Llama 3 has taken the lead for raw performance, the hardware requirements for Llama 2 70B Chat established the "dual-GPU" standard that remains the target for most serious local AI enthusiasts today.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 2 70B Chat

Meta's second-gen open LLM that opened the floodgates for commercial open-weight AI. 70B dense, 4K context. Trained on 2T tokens. On par with ChatGPT/PaLM at launch.

70B paramsDense4K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A workable 70B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~43.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Instruction Following

Model Specifications

Parameters70B

ArchitectureDense

Context Length4K tokens

ModalityText Only

Training CutoffSeptember 2022

ProviderMeta

Download Size275.9 GB

Community

Monthly Downloads144.5K

Likes2.2K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama2:70b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 2 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

42.3

Overall Score

41.1CC

Benchmark40%

42.3

Popularity25%

50.8

Efficiency20%

37.7

Versatility15%

26.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	28.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	43.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	50.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	76.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	142.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	62.1 tok/s	43.4 GB
Google Cloud TPU v5pGoogle	SS	51.3 tok/s	43.4 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	45.5 tok/s	43.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	37.8 tok/s	43.4 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	68.6 tok/s	43.4 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	89.0 tok/s	43.4 GB
AMD Instinct MI300XAMD	SS	98.3 tok/s	43.4 GB
Google TPU v7 (Ironwood)Google	SS	136.9 tok/s	43.4 GB
NVIDIA B200 GPUNVIDIA	SS	148.4 tok/s	43.4 GB
AMD Instinct MI325XAMD	SS	111.3 tok/s	43.4 GB
AMD Instinct MI355XAMD	SS	148.4 tok/s	43.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	131.7 tok/s	43.4 GB
Dell Pro Max with GB300Dell	SS	131.7 tok/s	43.4 GB
HP ZGX Fury AI StationHP	SS	131.7 tok/s	43.4 GB
MSI XpertStation WS300MSI	SS	131.7 tok/s	43.4 GB
SuperMicro Super AI StationSuperMicro	SS	131.7 tok/s	43.4 GB
Gigabyte W775-V10-L01Gigabyte	SS	131.7 tok/s	43.4 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.8 tok/s	43.4 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.5 tok/s	43.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.4 tok/s	43.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.8 tok/s	43.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	11.4 tok/s	43.4 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	11.4 tok/s	43.4 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	11.4 tok/s	43.4 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.4 tok/s	43.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 2 70B Chat on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.7 tok/s · 150W	$1.05
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 43 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key technical specifications include:

Parameter Count: 70 Billion (Dense)
Training Data: 2 Trillion tokens
Vocabulary Size: 32,000 tokens
Architecture: Transformer with GQA and Rotary Positional Embeddings (RoPE)
Context Window: 4,096 tokens

Capabilities & Use Cases

Llama 2 70B Chat is optimized for instruction-following and multi-turn dialogue. Its scale allows for a level of nuance and "common sense" reasoning that smaller 7B or 13B models typically lack.

Advanced Instruction Following

Llama 2 70B Chat for Coding

Content Transformation and Summarization

Running Llama 2 70B Chat Locally

Llama 2 70B Chat Hardware Requirements

Dual-GPU Setup (RTX 3090/4090 24GB x2): This is the gold standard for local practitioners. With 48GB of total VRAM, you can run a Q4_K_M GGUF or a 4-bit EXL2 quantization. This setup fits the model entirely on the GPUs, ensuring high speeds.
Mac Silicon (M2/M3/M4 Max or Ultra): Because Mac Silicon uses unified memory, you can run the 70B model if you have 64GB of RAM or more. An M2 Ultra with 128GB of RAM can run the model at higher precisions (Q6 or Q8) with ease.
Single GPU (RTX 3090/4090 24GB): You cannot run the full 70B model on a single 24GB card at a functional quantization level. You would need to use "system RAM offloading" via llama.cpp, which will result in extremely slow performance (often <1 token per second).

Best Quantization for Llama 2 70B Chat

Expected Performance (Tokens Per Second)

Llama 2 70B Chat tokens per second vary based on your memory bandwidth:

Dual RTX 3090/4090: 10–15 tokens/sec (very readable, faster than human reading speed).
Mac Studio (M2 Ultra): 8–12 tokens/sec.
CPU with DDR5 RAM: 1–3 tokens/sec (painfully slow for chat, okay for background processing).

How to Get Started

ollama run llama2:70b

This will automatically handle the quantization and memory management for your specific hardware.

How It Compares

When evaluating this model, it is important to compare it against its successor and its primary architectural rival.

Llama 2 70B Chat vs Llama 3 70B

Llama 2 70B Chat vs Mixtral 8x7B

Mixtral 8x7B is a Mixture-of-Experts (MoE) model. While it has a similar total parameter count, it only uses about 13B parameters per token during inference.

VRAM: Both require roughly the same VRAM (approx. 24GB–35GB for 4-bit quantizations).
Speed: Mixtral is significantly faster because it processes fewer parameters per token.
Reasoning: Llama 2 70B Chat often feels more "stable" for long-form instruction following, whereas Mixtral 8x7B is superior for coding and handling larger context (32k).