made by agents

Meta's updated 70B dense model claiming Llama 3.1 405B-level performance. Strong all-around for chat, code, and reasoning. 128K context.
Copy and paste this command to start running the model locally.
ollama run llama3.3:70bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 98.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 112.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 119.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 128.2 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 145.7 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 212.2 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA B200 GPUNVIDIA | SS | 57.1 tok/s | 112.8 GB | |
| SS | 42.8 tok/s | 112.8 GB | ||
| SS | 37.8 tok/s | 112.8 GB | ||
| SS | 57.1 tok/s | 112.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 34.3 tok/s | 112.8 GB | |
| AA | 26.4 tok/s | 112.8 GB | ||
| BB | 5.7 tok/s | 112.8 GB | ||
| BB | 5.8 tok/s | 112.8 GB | ||
| BB | 5.8 tok/s | 112.8 GB | ||
| BB | 5.7 tok/s | 112.8 GB | ||
| BB | 4.4 tok/s | 112.8 GB | ||
| BB | 4.4 tok/s | 112.8 GB | ||
| BB | 4.4 tok/s | 112.8 GB | ||
| BB | 3.9 tok/s | 112.8 GB | ||
| BB | 3.9 tok/s | 112.8 GB | ||
| BB | 3.9 tok/s | 112.8 GB | ||
| BB | 3.9 tok/s | 112.8 GB | ||
| BB | 1.9 tok/s | 112.8 GB | ||
| FF | 2.1 tok/s | 112.8 GB | ||
| FF | 3.1 tok/s | 112.8 GB | ||
| FF | 4.5 tok/s | 112.8 GB | ||
| FF | 5.7 tok/s | 112.8 GB | ||
| FF | 6.9 tok/s | 112.8 GB | ||
| FF | 4.6 tok/s | 112.8 GB | ||
| FF | 4.6 tok/s | 112.8 GB |
Llama 3.3 70B Instruct is Meta’s most efficient high-performance model to date, designed to deliver the capabilities of the massive Llama 3.1 405B model within a much more accessible 70B parameter footprint. Released in late 2024, this model represents the current state-of-the-art for the 70B class, specifically tuned for dialogue, reasoning, and complex instruction-following. For local practitioners, it sits in the "prosumer" sweet spot: it is too large for a single consumer GPU at high precision, but it is the primary target for dual-GPU workstations and high-end Mac Studio configurations.
Unlike its predecessors, Llama 3.3 70B Instruct is not just an incremental update. Meta has refined the training recipe to push Llama 3.3 70B Instruct performance to levels that previously required significantly more compute. It competes directly with proprietary models like GPT-4o and Claude 3.5 Sonnet in several benchmarks, making it the premier choice for developers who need GPT-4 class intelligence without the privacy risks or latency of a cloud API. If you are looking for a local AI model with 70B parameters in 2025, this is the industry standard.
The Llama 3.3 70B Instruct architecture follows a standard dense Transformer decoder-only design. While the industry has seen a shift toward Mixture-of-Experts (MoE) architectures to reduce inference costs, Meta has stuck with a dense 70B parameter model here. This architectural choice ensures consistent performance across all tokens and avoids the "expert routing" complexities found in models like Mixtral.
The 128k context window is a critical feature for local deployments. It allows for massive RAG (Retrieval-Augmented Generation) pipelines where entire technical documentations or code repositories can be injected into the prompt. Because it uses GQA, the memory overhead for the KV (Key-Value) cache is significantly reduced compared to standard Multi-Head Attention, though at 128k context, the KV cache will still consume a substantial amount of VRAM.
Llama 3.3 70B Instruct is a general-purpose powerhouse. Its instruction-following is significantly more robust than the Llama 3.1 70B, with fewer instances of "refusal" on complex but safe prompts and better adherence to system prompts.
This model is a top-tier choice for local software development. It excels at boilerplate generation, debugging, and explaining complex architectural patterns. In a Llama 3.3 70B Instruct reasoning benchmark context, it handles logical branching and multi-step coding tasks with a level of nuance that smaller 8B or 14B models lack. It supports all major programming languages, including Python, Rust, C++, and TypeScript, and is particularly effective when used with local IDE integrations like Continue or Aider.
With native support for function-calling, Llama 3.3 70B is built for "agentic" use cases. It can reliably output structured JSON and determine when to call external tools (like a web search or a database query). This makes it the ideal "brain" for a local AI agent that needs to interact with your local file system or private APIs.
The model is fine-tuned for high proficiency in English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. For practitioners handling multilingual datasets, the model maintains its reasoning capabilities across these languages. Its massive context window also makes it a superior tool for long-form summarization, capable of processing a 100-page PDF in a single pass.
To run Llama 3.3 70B Instruct locally, you must account for the sheer size of the weights. A 70B parameter model in 16-bit precision (FP16) requires approximately 140GB of VRAM just to load the weights, which is beyond the reach of any consumer setup. However, through quantization, this model becomes highly performant on enthusiast hardware.
The amount of VRAM you need depends entirely on your chosen quantization level:
The fastest way to deploy is via Ollama. Once installed, you can run:
ollama run llama3.3
This will default to a 4-bit quantized version, which is the most balanced approach for Llama 3.3 70B Instruct hardware requirements.
When deciding whether to deploy Llama 3.3 70B Instruct, it is helpful to compare it against its closest rivals in the open-weight space.
Qwen 2.5 72B (from Alibaba) is perhaps the strongest competitor.
Mistral Large 2 is a significantly larger model, but Llama 3.3 70B punches surprisingly close to its weight class.
There is almost no reason to use the older 3.1 version unless you have a specific fine-tune that hasn't been ported yet. Llama 3.3 is essentially a "drop-in" replacement that offers higher intelligence and better benchmark scores for the exact same VRAM cost.
In summary, Llama 3.3 70B Instruct is the definitive choice for local practitioners who have moved beyond the limitations of 8B models and have the hardware to support a 48GB VRAM footprint. It offers a "no-compromise" local AI experience, providing the reasoning depth and context handling required for professional-grade AI applications.