Meta

Llama 2 13B Chat

Meta's mid-size Llama 2 model. Good balance of performance and hardware requirements. 4K context.

13B paramsDense4K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A workable 13B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Instruction Following

Model Specifications

Parameters13B

ArchitectureDense

Context Length4K tokens

ModalityText Only

Training CutoffSeptember 2022

ProviderMeta

Download Size52.1 GB

Community

Monthly Downloads121.5K

Likes1.1K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama2:13b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 2 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

37.6

Overall Score

40.4CC

Benchmark40%

37.6

Popularity25%

27.0

Efficiency20%

77.0

Versatility15%

21.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	5.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	9.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	14.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	26.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.7 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	41.1 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	59.3 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.9 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.9 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.9 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	53.2 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.4 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	91.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	70.0 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.6 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.9 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	85.2 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	91.3 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	76.1 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	91.3 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	89.0 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.8 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	155.9 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	170.4 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	170.4 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	82.2 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 2 13B Chat on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.201
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Llama 2 13B Chat is the mid-tier entry in Meta’s second-generation family of large language models. Positioned between the lightweight 7B model and the resource-intensive 70B variant, the 13B model is frequently cited as the "sweet spot" for local deployment. It provides a significant step up in reasoning and nuance over 7B models while remaining small enough to run on high-end consumer hardware without requiring enterprise-grade A100 or H100 GPUs.

As a dense transformer model, Llama 2 13B Chat is optimized for dialogue and instruction-following. It was trained on 2 trillion tokens and refined using Reinforcement Learning from Human Feedback (RLHF) to ensure safer and more helpful interactions compared to the base foundation models. For practitioners, this model represents a stable, well-documented baseline for local RAG (Retrieval-Augmented Generation) applications and private chat interfaces where data sovereignty is a priority.

Architecture & Technical Details

The architecture of Llama 2 13B Chat follows a standard dense transformer decoder-only structure. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, this model utilizes all 13 billion parameters for every token generated. This results in highly predictable memory usage and compute requirements.

Key Specifications:

Parameters: 13 Billion (Dense)
Context Window: 4,096 tokens
Attention Mechanism: Multi-Head Attention (MHA)
Vocabulary Size: 32,000 tokens
Training Cutoff: September 2022

The 4,096-token context length is a defining characteristic of the Llama 2 era. While newer models like Llama 3.1 offer significantly larger windows (up to 128k), the 4k limit on Llama 2 13B Chat is sufficient for standard chat interactions, short-form summarization, and basic RAG pipelines. However, practitioners should be aware that performance may degrade as the context window fills, particularly in complex multi-turn conversations.

Capabilities & Use Cases

Llama 2 13B Chat is specifically tuned for instruction-following and conversational flows. It excels in environments where the model needs to adhere to a specific persona or follow structured system prompts. Because it was trained with a heavy emphasis on safety and alignment, it is less likely to generate "hallucinated" toxic content compared to unaligned base models, though this can sometimes lead to overly cautious refusals.

Primary Use Cases:

Private Local Assistants: Due to its size, it is a premier choice for a local AI model 13B parameters 2025 search, offering a balance of speed and intelligence for personal productivity.
Structured Data Extraction: It is competent at taking a block of text and returning JSON or Markdown, provided the prompt is clear.
Customer Support Simulations: The chat-finetuned nature makes it ideal for testing support workflows or acting as a roleplay partner for training.
Content Summarization: It can effectively condense documents that fit within its 4k context window, maintaining better coherence than the 7B version.

While it can handle basic coding tasks in Python or JavaScript, it is not a specialized coding model. If your primary goal is software development, a model like CodeLlama or DeepSeek-Coder would be more appropriate.

Running Llama 2 13B Chat Locally

The primary reason to run Llama 2 13B Chat locally is the ability to achieve high-quality inference on consumer-grade hardware. To do this effectively, you must understand the relationship between parameter count, quantization, and VRAM.

Llama 2 13B Chat VRAM Requirements

In its uncompressed 16-bit (FP16) state, the model requires approximately 26GB of VRAM just to load the weights. This exceeds the capacity of the flagship NVIDIA RTX 4090 (24GB). Therefore, almost all local practitioners use quantization (compressing the weights) to fit the model into available memory.

Quantization Level	VRAM Requirement (Approx.)	Recommended Hardware
Q8_0 (8-bit)	~14-15 GB	RTX 3090, 4090, 4060 Ti 16GB
Q4_K_M (4-bit)	~8-9 GB	RTX 3060 12GB, 4070, 4070 Super
Q2_K (2-bit)	~5-6 GB	Mid-range laptops, older 8GB GPUs

For the vast majority of users, the best quantization for Llama 2 13B Chat is Q4_K_M. This 4-bit implementation offers a massive reduction in memory usage with a negligible hit to perplexity (accuracy).

Hardware Recommendations

To achieve acceptable Llama 2 13B Chat performance, your choice of hardware is critical:

Best GPU for Llama 2 13B Chat: An NVIDIA RTX 4060 Ti (16GB) or RTX 4070 (12GB) is ideal for 4-bit or 5-bit quantization. If you have an RTX 4090, you can run the model at 8-bit quantization with extremely high throughput.
Mac Hardware: Apple Silicon (M2/M3/M4) is excellent for this model due to Unified Memory. An M3 Pro or Max with 18GB+ of RAM will run the 13B model comfortably.
CPU Inference: If you lack a dedicated GPU, you can run this model on a modern CPU with at least 16GB of system RAM using llama.cpp, though Llama 2 13B Chat tokens per second will drop significantly (typically 2-5 t/s).

Deployment Speed

Using a tool like Ollama is the fastest way to get started. By running ollama run llama2:13b, the software automatically handles the quantization and memory allocation. On an RTX 3080 or better, you can expect Llama 2 13B Chat tokens per second to range between 40 and 70, which is faster than the average human reading speed.

How It Compares

When evaluating Llama 2 13B Chat hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.

Llama 2 13B Chat vs. Llama 3 8B

Llama 3 8B is the successor to the Llama 2 line. Despite having fewer parameters, Llama 3 8B generally outperforms Llama 2 13B in benchmarks due to being trained on a much larger and higher-quality dataset (15 trillion tokens vs 2 trillion).

Why choose 13B Chat: Some practitioners prefer the specific "vibe" or prose style of Llama 2, or they have fine-tuned pipelines specifically built for the Llama 2 architecture.
Why choose Llama 3 8B: It is faster, requires less VRAM (~5.5GB for 4-bit), and has double the context window (8k vs 4k).

Llama 2 13B Chat vs. Mistral 7B

Mistral 7B is a perennial favorite for local hosting.

The Tradeoff: Mistral 7B often matches or beats Llama 2 13B in logic and coding tasks while being nearly half the size. However, Llama 2 13B Chat often feels more "stable" in long-form conversation and follows safety guidelines more strictly, which may be a requirement for certain enterprise-adjacent local deployments.

Llama 2 13B Chat vs. Gemma 2 9B

Google's Gemma 2 9B is a more modern alternative. It utilizes a different architecture (sliding window attention in some versions) that allows it to punch significantly above its weight class. In 2025, Gemma 2 9B is generally considered more capable for general-purpose reasoning, but Llama 2 13B remains the "old reliable" for those who need a model with a massive community ecosystem and guaranteed compatibility with every local LLM loader in existence.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

Llama 2 13B Chat

Meta's mid-size Llama 2 model. Good balance of performance and hardware requirements. 4K context.

13B paramsDense4K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A workable 13B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Instruction Following

Model Specifications

Parameters13B

ArchitectureDense

Context Length4K tokens

ModalityText Only

Training CutoffSeptember 2022

ProviderMeta

Download Size52.1 GB

Community

Monthly Downloads121.5K

Likes1.1K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run llama2:13b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Llama 2 Community LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

37.6

Overall Score

40.4CC

Benchmark40%

37.6

Popularity25%

27.0

Efficiency20%

77.0

Versatility15%

21.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	5.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	9.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	14.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	26.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.7 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	41.1 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	59.3 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.9 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.9 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.9 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	53.2 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.4 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	91.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.9 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	70.0 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.6 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.9 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	85.2 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	91.3 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	76.1 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	91.3 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	89.0 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.8 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	155.9 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	170.4 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	170.4 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	82.2 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Llama 2 13B Chat on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.201
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key Specifications:

Parameters: 13 Billion (Dense)
Context Window: 4,096 tokens
Attention Mechanism: Multi-Head Attention (MHA)
Vocabulary Size: 32,000 tokens
Training Cutoff: September 2022

Capabilities & Use Cases

Primary Use Cases:

Private Local Assistants: Due to its size, it is a premier choice for a local AI model 13B parameters 2025 search, offering a balance of speed and intelligence for personal productivity.
Structured Data Extraction: It is competent at taking a block of text and returning JSON or Markdown, provided the prompt is clear.
Customer Support Simulations: The chat-finetuned nature makes it ideal for testing support workflows or acting as a roleplay partner for training.
Content Summarization: It can effectively condense documents that fit within its 4k context window, maintaining better coherence than the 7B version.

Running Llama 2 13B Chat Locally

Llama 2 13B Chat VRAM Requirements

Quantization Level	VRAM Requirement (Approx.)	Recommended Hardware
Q8_0 (8-bit)	~14-15 GB	RTX 3090, 4090, 4060 Ti 16GB
Q4_K_M (4-bit)	~8-9 GB	RTX 3060 12GB, 4070, 4070 Super
Q2_K (2-bit)	~5-6 GB	Mid-range laptops, older 8GB GPUs

Hardware Recommendations

To achieve acceptable Llama 2 13B Chat performance, your choice of hardware is critical:

Best GPU for Llama 2 13B Chat: An NVIDIA RTX 4060 Ti (16GB) or RTX 4070 (12GB) is ideal for 4-bit or 5-bit quantization. If you have an RTX 4090, you can run the model at 8-bit quantization with extremely high throughput.
Mac Hardware: Apple Silicon (M2/M3/M4) is excellent for this model due to Unified Memory. An M3 Pro or Max with 18GB+ of RAM will run the 13B model comfortably.
CPU Inference: If you lack a dedicated GPU, you can run this model on a modern CPU with at least 16GB of system RAM using llama.cpp, though Llama 2 13B Chat tokens per second will drop significantly (typically 2-5 t/s).

Deployment Speed

How It Compares

When evaluating Llama 2 13B Chat hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.

Llama 2 13B Chat vs. Llama 3 8B

Why choose 13B Chat: Some practitioners prefer the specific "vibe" or prose style of Llama 2, or they have fine-tuned pipelines specifically built for the Llama 2 architecture.
Why choose Llama 3 8B: It is faster, requires less VRAM (~5.5GB for 4-bit), and has double the context window (8k vs 4k).

Llama 2 13B Chat vs. Mistral 7B

Mistral 7B is a perennial favorite for local hosting.

The Tradeoff: Mistral 7B often matches or beats Llama 2 13B in logic and coding tasks while being nearly half the size. However, Llama 2 13B Chat often feels more "stable" in long-form conversation and follows safety guidelines more strictly, which may be a requirement for certain enterprise-adjacent local deployments.