Technology Innovation Institute

Falcon 40B Instruct

TII's mid-size model. 40B dense, trained on 1T tokens of RefinedWeb data. One of the first truly capable open-weight models.

40B paramsDense2K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

A solid 40B-parameter dense language model from Technology Innovation Institute. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onApple M4Cheapest card in our directory with comfortable headroom (32 GB) for this model at Q4 (~24.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Multilingual

Model Specifications

Parameters40B

ArchitectureDense

Context Length2K tokens

ModalityText Only

Training Cutoff2023

ProviderTechnology Innovation Institute

Download Size167.4 GB

Community

Monthly Downloads17.0K

Likes1.2K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run falcon:40b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

59.4BB

Benchmark40%

89.0

Popularity25%

21.6

Efficiency20%

72.1

Versatility15%

26.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	16.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	24.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	28.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	33.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	43.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	81.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v6e (Trillium)Google	SS	54.2 tok/s	24.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	59.2 tok/s	24.4 GB
Origin PC M-CLASS v2Origin PC	SS	59.2 tok/s	24.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	67.4 tok/s	24.4 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	110.7 tok/s	24.4 GB
Google Cloud TPU v5pGoogle	SS	91.4 tok/s	24.4 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	81.0 tok/s	24.4 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	31.7 tok/s	24.4 GB
Origin PC L-CLASS v2Origin PC	SS	31.7 tok/s	24.4 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	122.3 tok/s	24.4 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	158.7 tok/s	24.4 GB
AMD Instinct MI300XAMD	AA	175.2 tok/s	24.4 GB
Google TPU v7 (Ironwood)Google	AA	243.9 tok/s	24.4 GB
NVIDIA B200 GPUNVIDIA	AA	264.4 tok/s	24.4 GB
AMD Instinct MI325XAMD	AA	198.3 tok/s	24.4 GB
AMD Instinct MI355XAMD	AA	264.4 tok/s	24.4 GB
NVIDIA L40SNVIDIA	AA	28.6 tok/s	24.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	234.7 tok/s	24.4 GB
Dell Pro Max with GB300Dell	AA	234.7 tok/s	24.4 GB
Gigabyte W775-V10-L01Gigabyte	AA	234.7 tok/s	24.4 GB
HP ZGX Fury AI StationHP	AA	234.7 tok/s	24.4 GB
MSI XpertStation WS300MSI	AA	234.7 tok/s	24.4 GB
SuperMicro Super AI StationSuperMicro	AA	234.7 tok/s	24.4 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	26.4 tok/s	24.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	26.4 tok/s	24.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7900 XTX (~32 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Falcon 40B Instruct on AMD Radeon RX 7900 XTX · ~32 tok/s · 355W	$0.373
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 24 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Falcon 40B Instruct is a causal decoder-only model developed by the Technology Innovation Institute (TII) in Abu Dhabi. Built on a dense architecture with 40 billion parameters, it was trained on 1 trillion tokens of the RefinedWeb dataset, supplemented by curated corpora. As one of the first high-performance open-weight models released under the Apache 2.0 license, Falcon 40B Instruct established a middle ground for practitioners: it offers significantly more reasoning depth than 7B or 13B models while remaining more accessible for local deployment than 70B+ parameter giants.

In the current landscape of local AI, Falcon 40B Instruct serves as a robust choice for users who require a permissive license and a model capable of complex instruction-following without the massive hardware overhead of a 70B parameter model. While newer architectures have emerged, Falcon 40B’s dense structure and high-quality training data ensure it remains a reliable baseline for private, local chat and coding assistants.

Architecture and Technical Details

Falcon 40B Instruct utilizes a dense transformer architecture. Unlike Mixture of Experts (MoE) models that only activate a fraction of their parameters during inference, Falcon 40B is a "dense" model, meaning all 40 billion parameters are active for every token generated. This results in high computational requirements but provides a level of consistency in reasoning that smaller dense models often lack.

A key technical feature of the Falcon family is its use of Multi-Query Attention (MQA). In standard multi-head attention, each head has its own key and value tensors. MQA shares a single key and value head across all query heads. For practitioners looking to run Falcon 40B Instruct locally, this architectural choice is significant because it drastically reduces the memory bandwidth required during the KV (Key-Value) cache lookup. This optimization allows for higher throughput and better scaling of inference speeds, even on hardware that might otherwise be bottlenecked by memory bus width.

The model features a context length of 2,048 tokens. While this is shorter than the 8k or 32k windows seen in 2024 and 2025 releases, it is sufficient for standard chat interactions, single-file code generation, and short-form document summarization. The model's efficiency is rooted in the RefinedWeb dataset, which prioritized high-quality web data over sheer volume, leading to a model that "punches above its weight" in terms of raw parameter count.

Capabilities and Use Cases

Falcon 40B Instruct is fine-tuned specifically for assistant-style interactions. Its capabilities span three primary domains:

Advanced Chat and Instruction Following

The model excels at nuanced dialogue and complex prompt execution. Because it was trained on a massive 1T token dataset, it possesses a broad world knowledge base. It is particularly effective for:

Local Knowledge Bases: Indexing and querying private documents where data privacy is paramount.
Roleplay and Persona Simulation: Maintaining consistent character traits over the course of a 2k token conversation.
Structured Data Extraction: Converting unstructured text into JSON or Markdown formats.

Coding and Scripting

Falcon 40B Instruct for coding is a viable use case for developers who need an offline pair programmer. While it may not match the specialized performance of a dedicated model like CodeLlama 70B, its general-purpose nature allows it to handle:

Python and JavaScript Development: Writing boilerplate, debugging logic, and explaining complex functions.
SQL Generation: Translating natural language questions into database queries.
Shell Scripting: Automating system tasks on Linux or macOS environments.

Multilingual Support

TII designed Falcon to be proficient across several European languages, including English, German, Spanish, and French, with additional capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Hungarian. This makes it a strong candidate for local translation tasks or multilingual customer support simulations where cloud latency is unacceptable.

Running Falcon 40B Instruct Locally

To successfully run Falcon 40B Instruct locally, hardware selection is the most critical factor. The primary bottleneck is Video RAM (VRAM). Because this is a 40B dense model, the weights alone require significant space before accounting for the KV cache and activation overhead.

Falcon 40B Instruct VRAM Requirements

The amount of VRAM needed depends entirely on the quantization level. We recommend using GGUF or EXL2 formats for local inference:

FP16 (Unquantized): ~90GB VRAM. Requires an enterprise-grade setup like an NVIDIA A100 (80GB) plus system RAM offloading, or dual A6000s.
8-bit (Q8_0): ~45GB VRAM. Fits on two NVIDIA RTX 3090/4090 GPUs (24GB each).
4-bit (Q4_K_M): ~26-28GB VRAM. This is the best quantization for Falcon 40B Instruct for most practitioners. It offers the best balance between model intelligence and hardware accessibility.
3-bit (Q3_K_L): ~18-20GB VRAM. This allows the model to fit on a single RTX 3090 or 4090, though some "perplexity" (intelligence) loss is noticeable.

Hardware Recommendations

When considering the best GPU for Falcon 40B Instruct, your goal is to keep as many layers as possible on the GPU to maximize Falcon 40B Instruct tokens per second.

NVIDIA Setup: A single RTX 4090 (24GB) can run the model at 3-bit or 3.5-bit quantization using tools like GGUF/llama.cpp. For the full 4-bit experience, a dual-GPU setup (e.g., 2x RTX 3090 24GB) is recommended.
Apple Silicon: The unified memory architecture of Mac makes it an excellent platform for this model. An M2 Ultra or M3/M4 Max with at least 64GB of RAM will run Falcon 40B Instruct at 4-bit or 5-bit quantization with high efficiency.
System RAM Offloading: If you lack sufficient VRAM, you can run the model on system CPU/RAM using GGUF format. However, expect performance to drop significantly, likely to 1–3 tokens per second.

Performance Expectations

On a modern setup (e.g., RTX 4090 with 4-bit quantization), you can expect Falcon 40B Instruct performance to land between 8 and 15 tokens per second. On Apple Silicon (M2 Ultra), speeds often exceed 20 tokens per second due to high memory bandwidth.

Software and Setup

The quickest way to deploy is via Ollama. Simply run:

ollama run falcon:40b

Ollama will automatically handle the quantization and hardware acceleration settings for your specific machine. For more granular control over VRAM allocation, use LM Studio or Text-Generation-WebUI.

How It Compares

When evaluating Falcon 40B Instruct against other models in the local AI model 40B parameters 2025 era, it occupies a specific niche.

Falcon 40B Instruct vs. Llama 3 70B

Llama 3 70B is a more modern model with a much larger context window (8k+) and generally higher scores on logic benchmarks. However, Llama 3 70B requires significantly more VRAM (~40GB for 4-bit). If you are constrained to a single 24GB GPU or a 32GB Mac, Falcon 40B is often the largest "smart" model you can reasonably fit, whereas Llama 3 70B would require heavy 2-bit quantization that degrades performance too much.

Falcon 40B Instruct vs. Mixtral 8x7B

Mixtral 8x7B is a Mixture of Experts model. While it has ~47B total parameters, it only uses ~13B per token, making it faster during inference than the dense Falcon 40B. Mixtral generally outperforms Falcon in coding and complex reasoning. However, Falcon 40B's dense architecture can sometimes feel more "stable" in its creative writing and follows strict formatting instructions more consistently in certain edge cases. Practitioners often choose Falcon 40B when their specific workload benefits from a dense parameter distribution rather than the sparse activation of an MoE.

Falcon 40B Instruct vs. Llama 3 8B

It is a common question: how to run 40B model on consumer GPU when smaller 8B models are so capable? While Llama 3 8B is faster and fits on almost any modern GPU, Falcon 40B Instruct maintains a clear advantage in "world knowledge" and the ability to handle more complex, multi-step instructions without hallucinating. If your task requires deep internal knowledge rather than just linguistic fluency, the 40B parameter count remains a significant asset.

Related Models

Technology Innovation Institute

Falcon 180B

180BDense

Explore the Provider

See all TII models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every TII model we track.

Open TII

Explore the Family

See every Falcon release

The full Falcon family leaderboard with sizes, benchmark scores, and a release timeline.

Open Falcon

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Technology Innovation Institute

Falcon 40B Instruct

TII's mid-size model. 40B dense, trained on 1T tokens of RefinedWeb data. One of the first truly capable open-weight models.

40B paramsDense2K ctx

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Heavyweight reasoning when you have the hardware

Run this onApple M4Cheapest card in our directory with comfortable headroom (32 GB) for this model at Q4 (~24.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Multilingual

Model Specifications

Parameters40B

ArchitectureDense

Context Length2K tokens

ModalityText Only

Training Cutoff2023

ProviderTechnology Innovation Institute

Download Size167.4 GB

Community

Monthly Downloads17.0K

Likes1.2K

Last Updated2 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run falcon:40b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

59.4BB

Benchmark40%

89.0

Popularity25%

21.6

Efficiency20%

72.1

Versatility15%

26.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	16.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	24.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	28.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	33.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	43.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	81.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v6e (Trillium)Google	SS	54.2 tok/s	24.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	59.2 tok/s	24.4 GB
Origin PC M-CLASS v2Origin PC	SS	59.2 tok/s	24.4 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	67.4 tok/s	24.4 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	110.7 tok/s	24.4 GB
Google Cloud TPU v5pGoogle	SS	91.4 tok/s	24.4 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	81.0 tok/s	24.4 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	31.7 tok/s	24.4 GB
Origin PC L-CLASS v2Origin PC	SS	31.7 tok/s	24.4 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	122.3 tok/s	24.4 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	158.7 tok/s	24.4 GB
AMD Instinct MI300XAMD	AA	175.2 tok/s	24.4 GB
Google TPU v7 (Ironwood)Google	AA	243.9 tok/s	24.4 GB
NVIDIA B200 GPUNVIDIA	AA	264.4 tok/s	24.4 GB
AMD Instinct MI325XAMD	AA	198.3 tok/s	24.4 GB
AMD Instinct MI355XAMD	AA	264.4 tok/s	24.4 GB
NVIDIA L40SNVIDIA	AA	28.6 tok/s	24.4 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	234.7 tok/s	24.4 GB
Dell Pro Max with GB300Dell	AA	234.7 tok/s	24.4 GB
Gigabyte W775-V10-L01Gigabyte	AA	234.7 tok/s	24.4 GB
HP ZGX Fury AI StationHP	AA	234.7 tok/s	24.4 GB
MSI XpertStation WS300MSI	AA	234.7 tok/s	24.4 GB
SuperMicro Super AI StationSuperMicro	AA	234.7 tok/s	24.4 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	26.4 tok/s	24.4 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	26.4 tok/s	24.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7900 XTX (~32 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Falcon 40B Instruct on AMD Radeon RX 7900 XTX · ~32 tok/s · 355W	$0.373
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 24 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture and Technical Details

Capabilities and Use Cases

Falcon 40B Instruct is fine-tuned specifically for assistant-style interactions. Its capabilities span three primary domains:

Advanced Chat and Instruction Following

The model excels at nuanced dialogue and complex prompt execution. Because it was trained on a massive 1T token dataset, it possesses a broad world knowledge base. It is particularly effective for:

Local Knowledge Bases: Indexing and querying private documents where data privacy is paramount.
Roleplay and Persona Simulation: Maintaining consistent character traits over the course of a 2k token conversation.
Structured Data Extraction: Converting unstructured text into JSON or Markdown formats.

Coding and Scripting

Python and JavaScript Development: Writing boilerplate, debugging logic, and explaining complex functions.
SQL Generation: Translating natural language questions into database queries.
Shell Scripting: Automating system tasks on Linux or macOS environments.

Multilingual Support

Running Falcon 40B Instruct Locally

Falcon 40B Instruct VRAM Requirements

The amount of VRAM needed depends entirely on the quantization level. We recommend using GGUF or EXL2 formats for local inference:

FP16 (Unquantized): ~90GB VRAM. Requires an enterprise-grade setup like an NVIDIA A100 (80GB) plus system RAM offloading, or dual A6000s.
8-bit (Q8_0): ~45GB VRAM. Fits on two NVIDIA RTX 3090/4090 GPUs (24GB each).
4-bit (Q4_K_M): ~26-28GB VRAM. This is the best quantization for Falcon 40B Instruct for most practitioners. It offers the best balance between model intelligence and hardware accessibility.
3-bit (Q3_K_L): ~18-20GB VRAM. This allows the model to fit on a single RTX 3090 or 4090, though some "perplexity" (intelligence) loss is noticeable.

Hardware Recommendations

When considering the best GPU for Falcon 40B Instruct, your goal is to keep as many layers as possible on the GPU to maximize Falcon 40B Instruct tokens per second.

NVIDIA Setup: A single RTX 4090 (24GB) can run the model at 3-bit or 3.5-bit quantization using tools like GGUF/llama.cpp. For the full 4-bit experience, a dual-GPU setup (e.g., 2x RTX 3090 24GB) is recommended.
Apple Silicon: The unified memory architecture of Mac makes it an excellent platform for this model. An M2 Ultra or M3/M4 Max with at least 64GB of RAM will run Falcon 40B Instruct at 4-bit or 5-bit quantization with high efficiency.
System RAM Offloading: If you lack sufficient VRAM, you can run the model on system CPU/RAM using GGUF format. However, expect performance to drop significantly, likely to 1–3 tokens per second.