made by agents

Technology Innovation Institute's largest model. 180B dense, trained on 3.5T tokens from the RefinedWeb dataset. Was the top open model on Hugging Face leaderboard at release.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 70.0 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 107.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 125.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 147.4 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 192.4 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 363.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA B200 GPUNVIDIA | SS | 59.7 tok/s | 107.8 GB | |
| SS | 39.6 tok/s | 107.8 GB | ||
| SS | 44.8 tok/s | 107.8 GB | ||
| SS | 59.7 tok/s | 107.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 35.8 tok/s | 107.8 GB | |
| AA | 27.6 tok/s | 107.8 GB | ||
| BB | 6.0 tok/s | 107.8 GB | ||
| BB | 6.1 tok/s | 107.8 GB | ||
| BB | 6.1 tok/s | 107.8 GB | ||
| BB | 6.0 tok/s | 107.8 GB | ||
| CC | 4.6 tok/s | 107.8 GB | ||
| CC | 4.6 tok/s | 107.8 GB | ||
| CC | 4.6 tok/s | 107.8 GB | ||
| CC | 4.1 tok/s | 107.8 GB | ||
| CC | 4.1 tok/s | 107.8 GB | ||
| CC | 4.1 tok/s | 107.8 GB | ||
| CC | 4.1 tok/s | 107.8 GB | ||
| CC | 2.0 tok/s | 107.8 GB | ||
| FF | 2.1 tok/s | 107.8 GB | ||
| FF | 3.2 tok/s | 107.8 GB | ||
| FF | 4.7 tok/s | 107.8 GB | ||
| FF | 6.0 tok/s | 107.8 GB | ||
| FF | 7.2 tok/s | 107.8 GB | ||
| FF | 4.8 tok/s | 107.8 GB | ||
| FF | 4.8 tok/s | 107.8 GB |
Falcon 180B, developed by the Technology Innovation Institute (TII) in Abu Dhabi, represents one of the most significant milestones in open-access large language models (LLMs). At 180 billion parameters, it is a massive dense model trained on 3.5 trillion tokens from the RefinedWeb dataset. Upon its release, it claimed the top spot on the Hugging Face Open LLM Leaderboard, outperforming competitors like Llama 2 70B and rivaling proprietary models like GPT-3.5 in specific reasoning tasks.
For practitioners looking to run Falcon 180B locally, the primary challenge is the sheer scale of the weights. Unlike Mixture-of-Experts (MoE) architectures that only activate a fraction of their parameters per token, Falcon 180B is a dense model. This means every inference pass utilizes all 180 billion parameters, demanding significant compute and memory bandwidth. It is designed for high-end workstations and enterprise-grade local deployments where data privacy and model sovereignty are non-negotiable.
The architecture of Falcon 180B is a causal decoder-only transformer, optimized for massive-scale inference. It builds upon the foundations laid by the earlier Falcon 7B and 40B models but scales the depth and width significantly.
The use of Multi-Query Attention is a critical technical choice for a model of this size. By sharing key and value tensors across all attention heads in a group, Falcon 180B significantly reduces the memory overhead of the KV (Key-Value) cache. This architectural optimization improves Falcon 180B performance by allowing for larger batch sizes and faster inference compared to standard Multi-Head Attention models of similar scale.
However, the 2,048-token context length is relatively short by 2025 standards. While this makes the model less suitable for long-document summarization or "chatting with your PDF" workflows, it remains highly effective for discrete tasks, code generation, and complex reasoning within that window.
Falcon 180B is a general-purpose model with strong capabilities in chat, coding, and multilingual tasks. Because it was trained on the RefinedWeb dataset—which prioritizes high-quality web data over sheer volume—it exhibits a high level of factual density and reasoning capability.
As a local AI model 180B parameters 2025 option, Falcon 180B excels at multi-step logic and nuanced instruction following. It is particularly effective as a backbone for internal corporate chatbots that require a high degree of "common sense" and a broad knowledge base without relying on external APIs.
The model demonstrates strong proficiency in Python, Java, C++, and JavaScript. While dedicated coding models like DeepSeek-Coder might outperform it in niche syntax, Falcon 180B’s advantage lies in its ability to understand the intent behind the code. It is an excellent choice for local code refactoring and generating documentation for complex legacy systems where data cannot leave the local network.
Falcon 180B was trained to be proficient in English, German, Spanish, and French, with additional capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish. This makes it a viable candidate for European enterprises requiring a local model that understands regional nuances and technical terminology in multiple languages.
The Falcon 180B hardware requirements are the highest in the open-model ecosystem, barring the Llama 3 405B. You cannot run this model on a standard consumer laptop or a single mid-range GPU.
To determine the best GPU for Falcon 180B, you must first decide on your quantization level. Running the model in full FP16 precision requires roughly 360GB of VRAM, which is only possible on multi-H100/A100 clusters. For local practitioners, quantization is mandatory.
If you are wondering how to run 180B model on consumer GPU setups, the answer is "parallelization."
On a high-end Mac Studio (M2 Ultra), you can expect Falcon 180B tokens per second to hover between 2 and 4. On a multi-GPU 4090 setup using llama.cpp or vLLM, you might see 4-8 tokens per second. While this is slow compared to 7B models, it is usable for asynchronous tasks or deep reasoning where quality is more important than speed.
The quickest way to get started is using Ollama. Once you have the necessary VRAM, you can simply run ollama run falcon:180b to pull the library's default quantization and begin testing.
When evaluating Falcon 180B, it is helpful to compare it against its closest competitors in the high-parameter space.
Llama 3 70B is a much newer model. Despite having fewer parameters, Llama 3 70B often matches or exceeds Falcon 180B's performance on modern benchmarks due to more advanced training techniques and a significantly larger training token count (15T vs 3.5T).
Grok-1 is a 314B parameter MoE model. While Grok-1 is "larger," its MoE architecture means it only uses about 86B active parameters per token.
Llama 3 405B is the current king of open weights. It is significantly more capable than Falcon 180B across all benchmarks but requires nearly double the VRAM (roughly 230GB for a 4-bit quant).
Falcon 180B remains a formidable choice for local deployment. Its density ensures that it utilizes its full parameter count for every query, providing a level of "brute force" intelligence that is still highly relevant for complex, local-first AI applications.