made by agents

TII's mid-size model. 40B dense, trained on 1T tokens of RefinedWeb data. One of the first truly capable open-weight models.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 16.0 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 24.4 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 28.4 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 33.2 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 43.2 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 81.2 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 54.2 tok/s | 24.4 GB | ||
| SS | 59.2 tok/s | 24.4 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 67.4 tok/s | 24.4 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 110.7 tok/s | 24.4 GB | |
Google Cloud TPU v5pGoogle | SS | 91.4 tok/s | 24.4 GB | |
| SS | 81.0 tok/s | 24.4 GB | ||
| SS | 31.7 tok/s | 24.4 GB | ||
| SS | 122.3 tok/s | 24.4 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 158.7 tok/s | 24.4 GB | |
| AA | 175.2 tok/s | 24.4 GB | ||
NVIDIA B200 GPUNVIDIA | AA | 264.4 tok/s | 24.4 GB | |
| AA | 198.3 tok/s | 24.4 GB | ||
| AA | 264.4 tok/s | 24.4 GB | ||
NVIDIA L40SNVIDIA | AA | 28.6 tok/s | 24.4 GB | |
| AA | 26.4 tok/s | 24.4 GB | ||
| AA | 26.4 tok/s | 24.4 GB | ||
| AA | 27.1 tok/s | 24.4 GB | ||
| AA | 27.1 tok/s | 24.4 GB | ||
| BB | 20.3 tok/s | 24.4 GB | ||
| BB | 20.3 tok/s | 24.4 GB | ||
| BB | 20.3 tok/s | 24.4 GB | ||
| BB | 13.2 tok/s | 24.4 GB | ||
| BB | 18.0 tok/s | 24.4 GB | ||
| BB | 18.0 tok/s | 24.4 GB | ||
| BB | 18.0 tok/s | 24.4 GB |
Falcon 40B Instruct is a causal decoder-only model developed by the Technology Innovation Institute (TII) in Abu Dhabi. Built on a dense architecture with 40 billion parameters, it was trained on 1 trillion tokens of the RefinedWeb dataset, supplemented by curated corpora. As one of the first high-performance open-weight models released under the Apache 2.0 license, Falcon 40B Instruct established a middle ground for practitioners: it offers significantly more reasoning depth than 7B or 13B models while remaining more accessible for local deployment than 70B+ parameter giants.
In the current landscape of local AI, Falcon 40B Instruct serves as a robust choice for users who require a permissive license and a model capable of complex instruction-following without the massive hardware overhead of a 70B parameter model. While newer architectures have emerged, Falcon 40B’s dense structure and high-quality training data ensure it remains a reliable baseline for private, local chat and coding assistants.
Falcon 40B Instruct utilizes a dense transformer architecture. Unlike Mixture of Experts (MoE) models that only activate a fraction of their parameters during inference, Falcon 40B is a "dense" model, meaning all 40 billion parameters are active for every token generated. This results in high computational requirements but provides a level of consistency in reasoning that smaller dense models often lack.
A key technical feature of the Falcon family is its use of Multi-Query Attention (MQA). In standard multi-head attention, each head has its own key and value tensors. MQA shares a single key and value head across all query heads. For practitioners looking to run Falcon 40B Instruct locally, this architectural choice is significant because it drastically reduces the memory bandwidth required during the KV (Key-Value) cache lookup. This optimization allows for higher throughput and better scaling of inference speeds, even on hardware that might otherwise be bottlenecked by memory bus width.
The model features a context length of 2,048 tokens. While this is shorter than the 8k or 32k windows seen in 2024 and 2025 releases, it is sufficient for standard chat interactions, single-file code generation, and short-form document summarization. The model's efficiency is rooted in the RefinedWeb dataset, which prioritized high-quality web data over sheer volume, leading to a model that "punches above its weight" in terms of raw parameter count.
Falcon 40B Instruct is fine-tuned specifically for assistant-style interactions. Its capabilities span three primary domains:
The model excels at nuanced dialogue and complex prompt execution. Because it was trained on a massive 1T token dataset, it possesses a broad world knowledge base. It is particularly effective for:
Falcon 40B Instruct for coding is a viable use case for developers who need an offline pair programmer. While it may not match the specialized performance of a dedicated model like CodeLlama 70B, its general-purpose nature allows it to handle:
TII designed Falcon to be proficient across several European languages, including English, German, Spanish, and French, with additional capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Hungarian. This makes it a strong candidate for local translation tasks or multilingual customer support simulations where cloud latency is unacceptable.
To successfully run Falcon 40B Instruct locally, hardware selection is the most critical factor. The primary bottleneck is Video RAM (VRAM). Because this is a 40B dense model, the weights alone require significant space before accounting for the KV cache and activation overhead.
The amount of VRAM needed depends entirely on the quantization level. We recommend using GGUF or EXL2 formats for local inference:
When considering the best GPU for Falcon 40B Instruct, your goal is to keep as many layers as possible on the GPU to maximize Falcon 40B Instruct tokens per second.
On a modern setup (e.g., RTX 4090 with 4-bit quantization), you can expect Falcon 40B Instruct performance to land between 8 and 15 tokens per second. On Apple Silicon (M2 Ultra), speeds often exceed 20 tokens per second due to high memory bandwidth.
The quickest way to deploy is via Ollama. Simply run:
ollama run falcon:40b
Ollama will automatically handle the quantization and hardware acceleration settings for your specific machine. For more granular control over VRAM allocation, use LM Studio or Text-Generation-WebUI.
When evaluating Falcon 40B Instruct against other models in the local AI model 40B parameters 2025 era, it occupies a specific niche.
Llama 3 70B is a more modern model with a much larger context window (8k+) and generally higher scores on logic benchmarks. However, Llama 3 70B requires significantly more VRAM (~40GB for 4-bit). If you are constrained to a single 24GB GPU or a 32GB Mac, Falcon 40B is often the largest "smart" model you can reasonably fit, whereas Llama 3 70B would require heavy 2-bit quantization that degrades performance too much.
Mixtral 8x7B is a Mixture of Experts model. While it has ~47B total parameters, it only uses ~13B per token, making it faster during inference than the dense Falcon 40B. Mixtral generally outperforms Falcon in coding and complex reasoning. However, Falcon 40B's dense architecture can sometimes feel more "stable" in its creative writing and follows strict formatting instructions more consistently in certain edge cases. Practitioners often choose Falcon 40B when their specific workload benefits from a dense parameter distribution rather than the sparse activation of an MoE.
It is a common question: how to run 40B model on consumer GPU when smaller 8B models are so capable? While Llama 3 8B is faster and fits on almost any modern GPU, Falcon 40B Instruct maintains a clear advantage in "world knowledge" and the ability to handle more complex, multi-step instructions without hallucinating. If your task requires deep internal knowledge rather than just linguistic fluency, the 40B parameter count remains a significant asset.