Meta

LLaMA 65B

Meta's original open LLM that kickstarted the open-source AI revolution. 65B dense model. Initially research-only, weights were leaked and spurred massive community development.

65B paramsDense2K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 65B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~39.3 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Model Specifications

Parameters65B

ArchitectureDense

Context Length2K tokens

ModalityText Only

Training Cutoff2022

ProviderMeta

Download Size261.1 GB

Community

Monthly Downloads1.2K

Likes77

Last Updated3 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Non-commercial research license

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

55.3BB

Benchmark40%

95.0

Popularity25%

3.0

Efficiency20%

67.2

Versatility15%

21.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	25.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	39.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	45.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	53.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	69.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	41.8 tok/s	39.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	68.7 tok/s	39.3 GB
Google Cloud TPU v5pGoogle	SS	56.7 tok/s	39.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	50.2 tok/s	39.3 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	75.9 tok/s	39.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	98.4 tok/s	39.3 GB
AMD Instinct MI300XAMD	SS	108.7 tok/s	39.3 GB
Google TPU v7 (Ironwood)Google	SS	151.3 tok/s	39.3 GB
NVIDIA B200 GPUNVIDIA	SS	164.0 tok/s	39.3 GB
AMD Instinct MI325XAMD	SS	123.0 tok/s	39.3 GB
AMD Instinct MI355XAMD	SS	164.0 tok/s	39.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	145.6 tok/s	39.3 GB
Dell Pro Max with GB300Dell	AA	145.6 tok/s	39.3 GB
HP ZGX Fury AI StationHP	AA	145.6 tok/s	39.3 GB
MSI XpertStation WS300MSI	AA	145.6 tok/s	39.3 GB
SuperMicro Super AI StationSuperMicro	AA	145.6 tok/s	39.3 GB
Gigabyte W775-V10-L01Gigabyte	AA	145.6 tok/s	39.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	16.4 tok/s	39.3 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	BB	19.7 tok/s	39.3 GB
Origin PC L-CLASS v2Origin PC	BB	19.7 tok/s	39.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	16.4 tok/s	39.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	8.2 tok/s	39.3 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	10.5 tok/s	39.3 GB
NVIDIA L40SNVIDIA	BB	17.7 tok/s	39.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	12.6 tok/s	39.3 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~5.2 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)LLaMA 65B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~5.2 tok/s · 150W	$0.953
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 39 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

LLaMA 65B is the flagship variant of Meta’s first-generation Large Language Model Meta AI (LLaMA) release. While it has since been succeeded by Llama 2 and Llama 3, the 65B model remains a significant milestone in the history of local LLMs. It was the first model of this scale to demonstrate that high-parameter performance could be achieved on consumer-accessible hardware through aggressive quantization and optimization.

As a dense, 65-billion parameter model, LLaMA 65B was designed to provide GPT-3 level performance within a footprint that developers could manage on high-end workstations. For practitioners, this model represents the foundation of the open-source fine-tuning movement; it was the base for legendary early fine-tunes like Vicuna and Alpaca. Today, it serves as a benchmark for testing how far architectural efficiency has come, though it still holds its own in specific reasoning and logic tasks where its dense parameter count provides a "stability" that smaller, more modern models sometimes lack.

Architecture & Technical Details

The architecture of LLaMA 65B is a standard decoder-only Transformer. However, it introduced several optimizations that have since become industry standards for efficient local inference. Unlike the Mixture of Experts (MoE) models that have recently gained popularity, LLaMA 65B is a dense model. This means every one of its 65 billion parameters is active during every inference pass.

From a hardware perspective, dense models like LLaMA 65B offer a predictable VRAM-to-performance ratio but require significant memory bandwidth to maintain acceptable speeds. The model utilizes:

RMSNorm: Used for improving training stability by normalizing the input of each transformer sub-layer.
SwiGLU Activation Function: Replaces standard ReLU to improve non-linear reasoning capabilities.
Rotary Positional Embeddings (RoPE): Provides better handling of relative positions within the sequence.

The primary limitation for modern practitioners is the 2,048 token context length. By 2025 standards, this is quite narrow. While there are techniques to extend this context (such as RoPE scaling), the base model is optimized for shorter-form interactions, code snippets, and logic-heavy tasks rather than long-document analysis.

Capabilities & Use Cases

LLaMA 65B was trained on 1.4 trillion tokens, focusing heavily on publicly available data like CommonCrawl, C4, and GitHub. This makes it a robust general-purpose engine for text and code.

Logic and Reasoning

The 65B parameter count allows for a level of emergent reasoning that is often absent in 7B or 13B models. It is particularly effective for multi-step logic problems where smaller models might "hallucinate" a shortcut. If you are running LLaMA 65B locally for research purposes, you will find it handles complex instructions with a higher degree of adherence than its smaller first-gen siblings.

Coding Performance

For those using LLaMA 65B for coding, the model provides strong support for Python, C++, Java, and JavaScript. Because it was trained on the GitHub dataset, it understands structural programming patterns and can assist in debugging or generating boilerplate code. However, due to the 2k context window, it is best suited for function-level tasks rather than refactoring entire repositories.

Foundation for Fine-Tuning

If your goal is to create a specialized local AI model with 65B parameters in 2025, the original LLaMA weights are often used as a "clean" baseline for research into alignment and instruction tuning. Its non-commercial license dictates that it remains a tool for practitioners and researchers rather than commercial product developers.

Running LLaMA 65B Locally

The biggest hurdle to running LLaMA 65B locally is the memory requirement. Because it is a dense model, the VRAM footprint is non-negotiable and scales directly with the precision (quantization) you choose.

LLaMA 65B VRAM Requirements

To calculate your needs, use the following estimates for the model weights alone (excluding context overhead):

FP16 (Original Precision): ~130 GB VRAM (Requires 2x A100 80GB or 3x RTX 6000 Ada).
Q8_0 (8-bit): ~70 GB VRAM (Requires 1x A100 80GB or 3x RTX 3090/4090).
Q4_K_M (4-bit): ~40 GB VRAM (The "sweet spot" for high-end consumer setups).
Q2_K (2-bit): ~24 GB VRAM (Fits on a single RTX 3090/4090, but with significant perplexity loss).

Best Hardware for LLaMA 65B

For a smooth experience, the best GPU for LLaMA 65B is typically a multi-GPU setup.

Dual RTX 3090/4090 (48GB Total VRAM): This is the gold standard for enthusiasts. Using 4-bit or 5-bit quantization, the model fits comfortably with enough room for the KV cache.
Mac Studio/MacBook Pro (M2/M3/M4 Max or Ultra): If you have 64GB of Unified Memory or more, LLaMA 65B runs exceptionally well on Apple Silicon via llama.cpp.
NVIDIA RTX 6000 Ada / A6000: These professional cards allow you to run the model on a single GPU, simplifying the setup.

Recommended Quantization

We recommend the Q4_K_M GGUF format for most users. This quantization level reduces the model size to approximately 40GB while maintaining nearly the same accuracy as the FP16 original. If you have more headroom, Q5_K_M offers a slight bump in logic retention at the cost of ~5-7GB of additional VRAM.

Performance and Inference Speed

When considering LLaMA 65B tokens per second (t/s), your memory bandwidth is the bottleneck.

On a dual RTX 4090 setup via ExLlamaV2, you can expect 10-15 t/s, which is faster than human reading speed.
On Apple Silicon (M2 Ultra), speeds typically range from 5-8 t/s.
On a single RTX 3090 using 3-bit quantization, you may see 8-10 t/s.

How to Run 65B Model on Consumer GPUs

The quickest way to get started is Ollama. After installing Ollama, you can run the model with a simple command:

ollama run llama:65b

Ollama will automatically handle the quantization and memory offloading if you have multiple GPUs. For more granular control over VRAM offloading, use LM Studio or KoboldCPP, which allow you to specify exactly how many layers to put on the GPU versus system RAM.

How It Compares

When evaluating LLaMA 65B, it is essential to compare it against its direct successors and modern alternatives in the same weight class.

LLaMA 65B vs Llama 2 70B

Llama 2 70B is the direct evolution of this model. Llama 2 was trained on 2 trillion tokens (40% more than 65B) and has a doubled context length of 4,096 tokens. In almost every benchmark, Llama 2 70B outperforms 65B, particularly in dialogue and safety. However, some practitioners still prefer the original 65B for specific fine-tuning tasks because it lacks some of the "over-refusal" behaviors seen in later Meta releases.

LLaMA 65B vs Llama 3 70B

There is no contest here in terms of raw intelligence; Llama 3 70B is significantly more capable, trained on 15 trillion tokens with an 8k context window. However, LLaMA 65B performance remains relevant for those studying the evolution of model weights or those who require a model with the specific 2022 training cutoff for temporal consistency in research.

LLaMA 65B vs Mixtral 8x7B

Mixtral 8x7B is a Mixture of Experts model. While it has a similar total parameter count, it only uses about 12B parameters per token. This makes Mixtral significantly faster to run than LLaMA 65B. If your priority is LLaMA 65B hardware requirements and you find them too steep for your current rig, Mixtral 8x7B offers a similar level of "intelligence" with much faster inference speeds, though LLaMA 65B’s dense architecture can sometimes feel more coherent in long-form logic.

In summary, LLaMA 65B is a legacy powerhouse. While newer models offer better efficiency and longer context, the 65B remains a staple for practitioners who want to experiment with a large, dense, foundational model on local multi-GPU hardware.

Related Models

Meta

Explore the Provider

See all Meta models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Meta model we track.

Open Meta

Explore the Family

See every Llama release

The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.

Open Llama

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Meta

LLaMA 65B

Meta's original open LLM that kickstarted the open-source AI revolution. 65B dense model. Initially research-only, weights were leaked and spurred massive community development.

65B paramsDense2K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 65B-parameter dense language model from Meta. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~39.3 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Model Specifications

Parameters65B

ArchitectureDense

Context Length2K tokens

ModalityText Only

Training Cutoff2022

ProviderMeta

Download Size261.1 GB

Community

Monthly Downloads1.2K

Likes77

Last Updated3 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Non-commercial research license

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

55.3BB

Benchmark40%

95.0

Popularity25%

3.0

Efficiency20%

67.2

Versatility15%

21.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	25.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	39.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	45.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	53.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	69.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	41.8 tok/s	39.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	68.7 tok/s	39.3 GB
Google Cloud TPU v5pGoogle	SS	56.7 tok/s	39.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	50.2 tok/s	39.3 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	75.9 tok/s	39.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	98.4 tok/s	39.3 GB
AMD Instinct MI300XAMD	SS	108.7 tok/s	39.3 GB
Google TPU v7 (Ironwood)Google	SS	151.3 tok/s	39.3 GB
NVIDIA B200 GPUNVIDIA	SS	164.0 tok/s	39.3 GB
AMD Instinct MI325XAMD	SS	123.0 tok/s	39.3 GB
AMD Instinct MI355XAMD	SS	164.0 tok/s	39.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	145.6 tok/s	39.3 GB
Dell Pro Max with GB300Dell	AA	145.6 tok/s	39.3 GB
HP ZGX Fury AI StationHP	AA	145.6 tok/s	39.3 GB
MSI XpertStation WS300MSI	AA	145.6 tok/s	39.3 GB
SuperMicro Super AI StationSuperMicro	AA	145.6 tok/s	39.3 GB
Gigabyte W775-V10-L01Gigabyte	AA	145.6 tok/s	39.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	16.4 tok/s	39.3 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	BB	19.7 tok/s	39.3 GB
Origin PC L-CLASS v2Origin PC	BB	19.7 tok/s	39.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	16.4 tok/s	39.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	8.2 tok/s	39.3 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	10.5 tok/s	39.3 GB
NVIDIA L40SNVIDIA	BB	17.7 tok/s	39.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	12.6 tok/s	39.3 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~5.2 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)LLaMA 65B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~5.2 tok/s · 150W	$0.953
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 39 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

From a hardware perspective, dense models like LLaMA 65B offer a predictable VRAM-to-performance ratio but require significant memory bandwidth to maintain acceptable speeds. The model utilizes:

RMSNorm: Used for improving training stability by normalizing the input of each transformer sub-layer.
SwiGLU Activation Function: Replaces standard ReLU to improve non-linear reasoning capabilities.
Rotary Positional Embeddings (RoPE): Provides better handling of relative positions within the sequence.

Capabilities & Use Cases

LLaMA 65B was trained on 1.4 trillion tokens, focusing heavily on publicly available data like CommonCrawl, C4, and GitHub. This makes it a robust general-purpose engine for text and code.

Logic and Reasoning

Coding Performance

Foundation for Fine-Tuning

Running LLaMA 65B Locally

LLaMA 65B VRAM Requirements

To calculate your needs, use the following estimates for the model weights alone (excluding context overhead):

FP16 (Original Precision): ~130 GB VRAM (Requires 2x A100 80GB or 3x RTX 6000 Ada).
Q8_0 (8-bit): ~70 GB VRAM (Requires 1x A100 80GB or 3x RTX 3090/4090).
Q4_K_M (4-bit): ~40 GB VRAM (The "sweet spot" for high-end consumer setups).
Q2_K (2-bit): ~24 GB VRAM (Fits on a single RTX 3090/4090, but with significant perplexity loss).

Best Hardware for LLaMA 65B

For a smooth experience, the best GPU for LLaMA 65B is typically a multi-GPU setup.

Dual RTX 3090/4090 (48GB Total VRAM): This is the gold standard for enthusiasts. Using 4-bit or 5-bit quantization, the model fits comfortably with enough room for the KV cache.
Mac Studio/MacBook Pro (M2/M3/M4 Max or Ultra): If you have 64GB of Unified Memory or more, LLaMA 65B runs exceptionally well on Apple Silicon via llama.cpp.
NVIDIA RTX 6000 Ada / A6000: These professional cards allow you to run the model on a single GPU, simplifying the setup.

Recommended Quantization

Performance and Inference Speed

When considering LLaMA 65B tokens per second (t/s), your memory bandwidth is the bottleneck.

On a dual RTX 4090 setup via ExLlamaV2, you can expect 10-15 t/s, which is faster than human reading speed.
On Apple Silicon (M2 Ultra), speeds typically range from 5-8 t/s.
On a single RTX 3090 using 3-bit quantization, you may see 8-10 t/s.