made by agents

Meta's mid-size Llama 2 model. Good balance of performance and hardware requirements. 4K context.
Copy and paste this command to start running the model locally.
ollama run llama2:13bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 5.7 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 8.5 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 9.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 11.3 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 14.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 26.9 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 41.1 tok/s | 8.5 GB | ||
| SS | 59.3 tok/s | 8.5 GB | ||
| SS | 60.9 tok/s | 8.5 GB | ||
| SS | 60.9 tok/s | 8.5 GB | ||
Google Cloud TPU v5eGoogle | SS | 77.9 tok/s | 8.5 GB | |
Intel Arc A770 16GBIntel | SS | 53.2 tok/s | 8.5 GB | |
Intel Arc B580Intel | SS | 43.4 tok/s | 8.5 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 47.9 tok/s | 8.5 GB | |
| SS | 47.9 tok/s | 8.5 GB | ||
| SS | 63.9 tok/s | 8.5 GB | ||
| SS | 70.0 tok/s | 8.5 GB | ||
| SS | 42.6 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 63.9 tok/s | 8.5 GB | |
| SS | 85.2 tok/s | 8.5 GB | ||
| SS | 91.3 tok/s | 8.5 GB | ||
| SS | 76.1 tok/s | 8.5 GB | ||
| SS | 91.3 tok/s | 8.5 GB | ||
| SS | 95.8 tok/s | 8.5 GB | ||
| SS | 155.9 tok/s | 8.5 GB | ||
| SS | 170.4 tok/s | 8.5 GB | ||
NVIDIA L40SNVIDIA | SS | 82.2 tok/s | 8.5 GB | |
| SS | 91.3 tok/s | 8.5 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | AA | 193.9 tok/s | 8.5 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | AA | 318.5 tok/s | 8.5 GB | |
Google Cloud TPU v5pGoogle | AA | 262.9 tok/s | 8.5 GB |
Llama 2 13B Chat is the mid-tier entry in Meta’s second-generation family of large language models. Positioned between the lightweight 7B model and the resource-intensive 70B variant, the 13B model is frequently cited as the "sweet spot" for local deployment. It provides a significant step up in reasoning and nuance over 7B models while remaining small enough to run on high-end consumer hardware without requiring enterprise-grade A100 or H100 GPUs.
As a dense transformer model, Llama 2 13B Chat is optimized for dialogue and instruction-following. It was trained on 2 trillion tokens and refined using Reinforcement Learning from Human Feedback (RLHF) to ensure safer and more helpful interactions compared to the base foundation models. For practitioners, this model represents a stable, well-documented baseline for local RAG (Retrieval-Augmented Generation) applications and private chat interfaces where data sovereignty is a priority.
The architecture of Llama 2 13B Chat follows a standard dense transformer decoder-only structure. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, this model utilizes all 13 billion parameters for every token generated. This results in highly predictable memory usage and compute requirements.
The 4,096-token context length is a defining characteristic of the Llama 2 era. While newer models like Llama 3.1 offer significantly larger windows (up to 128k), the 4k limit on Llama 2 13B Chat is sufficient for standard chat interactions, short-form summarization, and basic RAG pipelines. However, practitioners should be aware that performance may degrade as the context window fills, particularly in complex multi-turn conversations.
Llama 2 13B Chat is specifically tuned for instruction-following and conversational flows. It excels in environments where the model needs to adhere to a specific persona or follow structured system prompts. Because it was trained with a heavy emphasis on safety and alignment, it is less likely to generate "hallucinated" toxic content compared to unaligned base models, though this can sometimes lead to overly cautious refusals.
While it can handle basic coding tasks in Python or JavaScript, it is not a specialized coding model. If your primary goal is software development, a model like CodeLlama or DeepSeek-Coder would be more appropriate.
The primary reason to run Llama 2 13B Chat locally is the ability to achieve high-quality inference on consumer-grade hardware. To do this effectively, you must understand the relationship between parameter count, quantization, and VRAM.
In its uncompressed 16-bit (FP16) state, the model requires approximately 26GB of VRAM just to load the weights. This exceeds the capacity of the flagship NVIDIA RTX 4090 (24GB). Therefore, almost all local practitioners use quantization (compressing the weights) to fit the model into available memory.
| Quantization Level | VRAM Requirement (Approx.) | Recommended Hardware |
| :--- | :--- | :--- |
| Q8_0 (8-bit) | ~14-15 GB | RTX 3090, 4090, 4060 Ti 16GB |
| Q4_K_M (4-bit) | ~8-9 GB | RTX 3060 12GB, 4070, 4070 Super |
| Q2_K (2-bit) | ~5-6 GB | Mid-range laptops, older 8GB GPUs |
For the vast majority of users, the best quantization for Llama 2 13B Chat is Q4_K_M. This 4-bit implementation offers a massive reduction in memory usage with a negligible hit to perplexity (accuracy).
To achieve acceptable Llama 2 13B Chat performance, your choice of hardware is critical:
llama.cpp, though Llama 2 13B Chat tokens per second will drop significantly (typically 2-5 t/s).Using a tool like Ollama is the fastest way to get started. By running ollama run llama2:13b, the software automatically handles the quantization and memory allocation. On an RTX 3080 or better, you can expect Llama 2 13B Chat tokens per second to range between 40 and 70, which is faster than the average human reading speed.
When evaluating Llama 2 13B Chat hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.
Llama 3 8B is the successor to the Llama 2 line. Despite having fewer parameters, Llama 3 8B generally outperforms Llama 2 13B in benchmarks due to being trained on a much larger and higher-quality dataset (15 trillion tokens vs 2 trillion).
Mistral 7B is a perennial favorite for local hosting.
Google's Gemma 2 9B is a more modern alternative. It utilizes a different architecture (sliding window attention in some versions) that allows it to punch significantly above its weight class. In 2025, Gemma 2 9B is generally considered more capable for general-purpose reasoning, but Llama 2 13B remains the "old reliable" for those who need a model with a massive community ecosystem and guaranteed compatibility with every local LLM loader in existence.