Liquid AI

LFM2.5-8B-A1B

Liquid AI's on-device Mixture-of-Experts model with 8.3B total parameters and about 1.5B active per token, built on the LFM2.5 hybrid architecture of gated short-convolution and grouped-query attention blocks (24 layers in total). It is a text-only reasoning model that writes an explicit chain of thought before its final answer, supports a 128K-token context, and was pretrained on 38 trillion tokens. Liquid positions it for agentic tool use and private on-device assistants, citing 91.84 on IFEval, 88.76 on MATH500, and 88.07 on Tau² Telecom. The open-weight model runs fully on phones, laptops, and PCs and ships under the LFM Open License v1.0.

8.3B paramsMoE128K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at IFBench in its size class

A workable 8.3B-parameter MoE language model from Liquid AI. Pulls ahead on IFBench (56/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~2.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Reasoning

Function Calling

Math

Multilingual

Instruction Following

Model Specifications

Parameters8.3B

Active Params1.5B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

ProviderLiquid AI

Download Size17.0 GB

Community

Monthly Downloads200.9K

Likes638

Last Updated17 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

LFM Open License v1.0View Full License

Performance & Scoring

Benchmarks

GPQA

51.3

HLE

6.9

AA Intelligence Index

8.3

7.8

4.5

55.6

0.0

MBA Open Score

46.1CC

Benchmark40%

19.2

Popularity25%

38.3

Efficiency20%

91.5

Versatility15%

70.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	2.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	2.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	3.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	3.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	3.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	5.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	SS	79.8 tok/s	2.9 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	75.3 tok/s	2.9 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	124.1 tok/s	2.9 GB
AMD Radeon RX 7700 XTAMD	SS	119.7 tok/s	2.9 GB
Intel Arc B580Intel	SS	126.3 tok/s	2.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	139.6 tok/s	2.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	139.6 tok/s	2.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	186.2 tok/s	2.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	141.8 tok/s	2.9 GB
AMD Radeon RX 7800 XTAMD	AA	172.9 tok/s	2.9 GB
AMD Radeon RX 9070AMD	AA	177.3 tok/s	2.9 GB
AMD Radeon RX 9070 XTAMD	AA	177.3 tok/s	2.9 GB
Google Cloud TPU v5eGoogle	AA	226.9 tok/s	2.9 GB
Intel Arc A770 16GBIntel	AA	155.1 tok/s	2.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	265.9 tok/s	2.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	79.8 tok/s	2.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	186.2 tok/s	2.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	203.9 tok/s	2.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	124.1 tok/s	2.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	248.2 tok/s	2.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	265.9 tok/s	2.9 GB
AMD Radeon RX 7900 XTAMD	AA	221.6 tok/s	2.9 GB
AMD Radeon RX 7900 XTXAMD	AA	265.9 tok/s	2.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	259.3 tok/s	2.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	279.2 tok/s	2.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~80 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)LFM2.5-8B-A1B on AMD Radeon RX 7600 8GB · ~80 tok/s · 165W	$0.069
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

LFM2.5-8B-A1B is Liquid AI's second-generation on-device Mixture-of-Experts model, designed explicitly for local inference on consumer hardware. With 8.3 billion total parameters and only 1.5 billion active per token, it occupies a unique position: it delivers quality competitive with dense 7B-8B models while requiring a fraction of the compute budget.

Liquid built this model for agentic workflows — tool calling, multi-step reasoning, and instruction following on devices ranging from phones to laptops. It is a text-only reasoning model that produces an explicit chain of thought before its final answer, a design choice that leverages the MoE architecture's compute-bound nature to add quality without sacrificing speed.

The model was pretrained on 38 trillion tokens, up from 12 trillion in its predecessor LFM2-8B-A1B, and underwent large-scale reinforcement learning. The result is a model that scores 91.84 on IFEval, 88.76 on MATH500, and 88.07 on Tau² Telecom — numbers that put it in striking distance of much larger models while running entirely on-device.

Architecture & Technical Details

LFM2.5-8B-A1B uses a hybrid architecture combining gated short-convolution blocks with grouped-query attention (GQA). The model has 24 layers total: 18 double-gated LIV convolution blocks and 6 GQA blocks. This design, inherited from LFM2, balances local pattern extraction with global attention.

The MoE configuration is what makes this model practical for local use. Out of 8.3B total parameters, only 1.5B are active for any given token. This means inference speed is closer to what you'd expect from a 1.5B dense model, while the model retains the representational capacity of an 8B model. For practitioners, this translates directly to lower VRAM requirements and higher tokens per second compared to dense models of similar quality.

Key specs:

Total parameters: 8.3B
Active parameters: 1.5B
Architecture: MoE with hybrid gated short-convolution + GQA
Context length: 128,000 tokens
Vocabulary size: 128,000 tokens
Training data: 38 trillion tokens
Modality: Text-only

The expanded 128K context window (up from 32K in LFM2-8B-A1B) enables processing of long documents, multi-turn conversations, and extended reasoning chains. The vocabulary was doubled to 128K tokens to improve tokenization efficiency for non-Latin scripts — Liquid reports strong compression gains for Hindi, Thai, Vietnamese, Indonesian, and Arabic.

Unlike its predecessor, LFM2.5-8B-A1B is a reasoning-only model. It generates an explicit chain of thought before its final answer. This is not a gimmick: MoE models are typically compute-bound, and the small active parameter count makes each reasoning token cheap. The quality improvement from reasoning is substantial — the model's AA-Omniscience Index jumped from -78.42 to -24.70, with the non-hallucination rate improving from 7.46% to 63.47%.

Capabilities & Use Cases

LFM2.5-8B-A1B is a general-purpose text model, but it has clear strengths and weaknesses.

Strengths:

Tool calling and function-calling: This is the primary use case. The model scores 64.36 on BFCLv3 and 48.50 on BFCLv4. It can chain multiple tool calls and follow complex instruction sequences.
Instruction following: 91.84 on IFEval, 79.93 on Multi-IF. It handles constrained generation and structured outputs reliably.
Math and reasoning: 88.76 on MATH500, 42.53 on AIME25. The explicit chain-of-thought helps with multi-step problems.
Multilingual support: English, Arabic, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish. The expanded tokenizer makes it more efficient for non-Latin scripts than its predecessor.
On-device agents: Designed for personal assistants that run locally, with day-one support for llama.cpp, MLX, vLLM, and SGLang.

Weaknesses:

Not ideal for heavy programming or knowledge-intensive QA without retrieval augmentation.
The reasoning requirement means every response includes a chain of thought, which adds latency for simple queries.
Being a reasoning model, it may over-think straightforward questions.

Concrete use cases:

Local agent that calls APIs, reads files, and executes commands based on natural language instructions
Multilingual customer-facing assistant running on a laptop or tablet
Document analysis with retrieval-augmented generation (RAG) using the 128K context window
Privacy-sensitive applications where data cannot leave the device

Running LFM2.5-8B-A1B Locally

This is where LFM2.5-8B-A1B justifies its existence. With only 1.5B active parameters, it runs on hardware that would struggle with dense 7B models.

Hardware Requirements

Minimum (quantized, CPU inference):

8GB system RAM
Any modern CPU with AVX2 support
Expect 1-3 tokens/second on CPU with Q4_K_M quantization

Recommended (GPU inference):

NVIDIA RTX 3060 12GB or better (12GB VRAM minimum for FP16)
Apple Silicon Mac with 16GB+ unified memory (M2 Pro or better)
AMD Radeon RX 6800 XT or better (16GB VRAM)

Ideal:

RTX 4090 (24GB VRAM): runs full FP16 with room to spare
M4 Max (64GB+ unified memory): runs FP16 comfortably
Any GPU with 16GB+ VRAM for Q4_K_M at high throughput

Quantization and VRAM

Quantization	VRAM Required	Quality Retention	Recommended Use
FP16	~16GB	100%	GPU with 24GB+ VRAM
Q8_0	~8.5GB	~99%	GPU with 12-16GB VRAM
Q4_K_M	~5.5GB	~95%	Best balance for most users
Q3_K_M	~4.5GB	~90%	8GB VRAM GPUs
Q2_K	~3.5GB	~85%	6GB VRAM or CPU inference

For most practitioners, Q4_K_M is the sweet spot. It retains 95% of FP16 quality while using only 5.5GB VRAM, leaving room for the KV cache and context. At this quantization, you can run the model on an RTX 3060 12GB, RTX 4060 Ti 16GB, or any Apple Silicon Mac with 16GB+ unified memory.

Performance Expectations

On an RTX 4090 with Q4_K_M:

50-80 tokens/second (prompt processing)
30-50 tokens/second (generation)

On an M4 Max (64GB) with MLX Q8:

40-60 tokens/second

On CPU with Q4_K_M (16-core modern CPU):

3-6 tokens/second

Quick Start

The fastest way to get up and running is via Ollama:

1ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF:Q4_K_M

For llama.cpp directly:

1llama-cli -hf LiquidAI/LFM2.5-8B-A1B-GGUF -c 4096 --color -i --temp 0.2 --top-k 80 --repeat-penalty 1.05

For Apple Silicon, use MLX:

1pip install mlx-lm
2mlx_lm.generate --model LiquidAI/LFM2.5-8B-A1B-MLX-8bit --max-tokens 512

For production inference servers, vLLM and SGLang both support the model with OpenAI-compatible APIs.

Important note: If you downloaded the model before commit feb5e04, re-download the tokenizer files. A tokenizer update fixed tool-calling issues in llama.cpp.

How It Compares

vs. Qwen2.5-7B-Instruct (dense 7B)

Qwen2.5-7B is a dense model with 7.6B active parameters. It requires ~15GB VRAM at FP16 and ~5GB at Q4_K_M.

LFM2.5-8B-A1B advantages:

Higher throughput due to 1.5B active parameters (2-3x faster generation)
Better tool calling and function-calling scores
More efficient multilingual support with 128K vocabulary
Explicit chain-of-thought reasoning

Qwen2.5-7B advantages:

Broader knowledge base (more training data on general topics)
Stronger programming capabilities
No forced chain-of-thought (faster for simple queries)
Larger ecosystem and community support

Choose LFM2.5-8B-A1B when: You need a local agent that makes many tool calls, runs on limited hardware, or requires multilingual support. Choose Qwen2.5-7B when you need general-purpose chat, coding, or knowledge tasks without the overhead of reasoning tokens.

vs. Phi-3.5-MoE-instruct (MoE 6.6B total / ~3.8B active)

Phi-3.5-MoE has 6.6B total parameters with ~3.8B active — more than double LFM2.5's active parameters.

LFM2.5-8B-A1B advantages:

Lower active parameters (1.5B vs 3.8B) means higher throughput
128K context vs 32K
Better reasoning benchmarks (88.76 MATH500 vs ~76 for Phi-3.5-MoE)
Better multilingual support

Phi-3.5-MoE advantages:

Smaller total size (6.6B vs 8.3B) means lower disk and RAM requirements
No forced chain-of-thought
More mature ecosystem (released earlier, more community quantizations)

Choose LFM2.5-8B-A1B when: You need the combination of small active parameters, long context, and strong reasoning. Choose Phi-3.5-MoE when you need the smallest possible footprint and don't need long context or multilingual support.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

8.3B

Liquid AI

LFM2.5-8B-A1B

8.3B paramsMoE128K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at IFBench in its size class

A workable 8.3B-parameter MoE language model from Liquid AI. Pulls ahead on IFBench (56/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~2.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Reasoning

Function Calling

Math

Multilingual

Instruction Following

Model Specifications

Parameters8.3B

Active Params1.5B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

ProviderLiquid AI

Download Size17.0 GB

Community

Monthly Downloads200.9K

Likes638

Last Updated17 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

LFM Open License v1.0View Full License

Performance & Scoring

Benchmarks

GPQA

51.3

HLE

6.9

AA Intelligence Index

8.3

7.8

4.5

55.6

0.0

MBA Open Score

46.1CC

Benchmark40%

19.2

Popularity25%

38.3

Efficiency20%

91.5

Versatility15%

70.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	2.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	2.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	3.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	3.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	3.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	5.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	SS	79.8 tok/s	2.9 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	75.3 tok/s	2.9 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	124.1 tok/s	2.9 GB
AMD Radeon RX 7700 XTAMD	SS	119.7 tok/s	2.9 GB
Intel Arc B580Intel	SS	126.3 tok/s	2.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	139.6 tok/s	2.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	139.6 tok/s	2.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	186.2 tok/s	2.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	141.8 tok/s	2.9 GB
AMD Radeon RX 7800 XTAMD	AA	172.9 tok/s	2.9 GB
AMD Radeon RX 9070AMD	AA	177.3 tok/s	2.9 GB
AMD Radeon RX 9070 XTAMD	AA	177.3 tok/s	2.9 GB
Google Cloud TPU v5eGoogle	AA	226.9 tok/s	2.9 GB
Intel Arc A770 16GBIntel	AA	155.1 tok/s	2.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	265.9 tok/s	2.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	79.8 tok/s	2.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	186.2 tok/s	2.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	203.9 tok/s	2.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	124.1 tok/s	2.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	248.2 tok/s	2.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	265.9 tok/s	2.9 GB
AMD Radeon RX 7900 XTAMD	AA	221.6 tok/s	2.9 GB
AMD Radeon RX 7900 XTXAMD	AA	265.9 tok/s	2.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	259.3 tok/s	2.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	279.2 tok/s	2.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~80 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)LFM2.5-8B-A1B on AMD Radeon RX 7600 8GB · ~80 tok/s · 165W	$0.069
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key specs:

Total parameters: 8.3B
Active parameters: 1.5B
Architecture: MoE with hybrid gated short-convolution + GQA
Context length: 128,000 tokens
Vocabulary size: 128,000 tokens
Training data: 38 trillion tokens
Modality: Text-only

Capabilities & Use Cases

LFM2.5-8B-A1B is a general-purpose text model, but it has clear strengths and weaknesses.

Strengths:

Tool calling and function-calling: This is the primary use case. The model scores 64.36 on BFCLv3 and 48.50 on BFCLv4. It can chain multiple tool calls and follow complex instruction sequences.
Instruction following: 91.84 on IFEval, 79.93 on Multi-IF. It handles constrained generation and structured outputs reliably.
Math and reasoning: 88.76 on MATH500, 42.53 on AIME25. The explicit chain-of-thought helps with multi-step problems.
Multilingual support: English, Arabic, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish. The expanded tokenizer makes it more efficient for non-Latin scripts than its predecessor.
On-device agents: Designed for personal assistants that run locally, with day-one support for llama.cpp, MLX, vLLM, and SGLang.

Weaknesses:

Not ideal for heavy programming or knowledge-intensive QA without retrieval augmentation.
The reasoning requirement means every response includes a chain of thought, which adds latency for simple queries.
Being a reasoning model, it may over-think straightforward questions.

Concrete use cases:

Local agent that calls APIs, reads files, and executes commands based on natural language instructions
Multilingual customer-facing assistant running on a laptop or tablet
Document analysis with retrieval-augmented generation (RAG) using the 128K context window
Privacy-sensitive applications where data cannot leave the device

Running LFM2.5-8B-A1B Locally

This is where LFM2.5-8B-A1B justifies its existence. With only 1.5B active parameters, it runs on hardware that would struggle with dense 7B models.

Hardware Requirements

Minimum (quantized, CPU inference):

8GB system RAM
Any modern CPU with AVX2 support
Expect 1-3 tokens/second on CPU with Q4_K_M quantization

Recommended (GPU inference):

NVIDIA RTX 3060 12GB or better (12GB VRAM minimum for FP16)
Apple Silicon Mac with 16GB+ unified memory (M2 Pro or better)
AMD Radeon RX 6800 XT or better (16GB VRAM)

Ideal:

RTX 4090 (24GB VRAM): runs full FP16 with room to spare
M4 Max (64GB+ unified memory): runs FP16 comfortably
Any GPU with 16GB+ VRAM for Q4_K_M at high throughput

Quantization and VRAM

Quantization	VRAM Required	Quality Retention	Recommended Use
FP16	~16GB	100%	GPU with 24GB+ VRAM
Q8_0	~8.5GB	~99%	GPU with 12-16GB VRAM
Q4_K_M	~5.5GB	~95%	Best balance for most users
Q3_K_M	~4.5GB	~90%	8GB VRAM GPUs
Q2_K	~3.5GB	~85%	6GB VRAM or CPU inference

Performance Expectations

On an RTX 4090 with Q4_K_M:

50-80 tokens/second (prompt processing)
30-50 tokens/second (generation)

On an M4 Max (64GB) with MLX Q8:

40-60 tokens/second

On CPU with Q4_K_M (16-core modern CPU):

3-6 tokens/second

Quick Start

The fastest way to get up and running is via Ollama:

1ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF:Q4_K_M

For llama.cpp directly:

1llama-cli -hf LiquidAI/LFM2.5-8B-A1B-GGUF -c 4096 --color -i --temp 0.2 --top-k 80 --repeat-penalty 1.05

For Apple Silicon, use MLX:

1pip install mlx-lm
2mlx_lm.generate --model LiquidAI/LFM2.5-8B-A1B-MLX-8bit --max-tokens 512

For production inference servers, vLLM and SGLang both support the model with OpenAI-compatible APIs.

Important note: If you downloaded the model before commit feb5e04, re-download the tokenizer files. A tokenizer update fixed tool-calling issues in llama.cpp.

How It Compares

vs. Qwen2.5-7B-Instruct (dense 7B)

Qwen2.5-7B is a dense model with 7.6B active parameters. It requires ~15GB VRAM at FP16 and ~5GB at Q4_K_M.

LFM2.5-8B-A1B advantages:

Higher throughput due to 1.5B active parameters (2-3x faster generation)
Better tool calling and function-calling scores
More efficient multilingual support with 128K vocabulary
Explicit chain-of-thought reasoning

Qwen2.5-7B advantages:

Broader knowledge base (more training data on general topics)
Stronger programming capabilities
No forced chain-of-thought (faster for simple queries)
Larger ecosystem and community support

vs. Phi-3.5-MoE-instruct (MoE 6.6B total / ~3.8B active)

Phi-3.5-MoE has 6.6B total parameters with ~3.8B active — more than double LFM2.5's active parameters.

LFM2.5-8B-A1B advantages:

Lower active parameters (1.5B vs 3.8B) means higher throughput
128K context vs 32K
Better reasoning benchmarks (88.76 MATH500 vs ~76 for Phi-3.5-MoE)
Better multilingual support

Phi-3.5-MoE advantages:

Smaller total size (6.6B vs 8.3B) means lower disk and RAM requirements
No forced chain-of-thought
More mature ecosystem (released earlier, more community quantizations)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.