Liquid AI's on-device Mixture-of-Experts model with 8.3B total parameters and about 1.5B active per token, built on the LFM2.5 hybrid architecture of gated short-convolution and grouped-query attention blocks (24 layers in total). It is a text-only reasoning model that writes an explicit chain of thought before its final answer, supports a 128K-token context, and was pretrained on 38 trillion tokens. Liquid positions it for agentic tool use and private on-device assistants, citing 91.84 on IFEval, 88.76 on MATH500, and 88.07 on Tau² Telecom. The open-weight model runs fully on phones, laptops, and PCs and ships under the LFM Open License v1.0.
A workable 8.3B-parameter MoE language model from Liquid AI. Pulls ahead on IFBench (56/100), so reach for it when that's the dimension that matters.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 2.6 GB | Low | |
| Q4_K_MRecommended | 2.9 GB | Good | |
| Q5_K_M | 3.1 GB | Very Good | |
| Q6_K | 3.2 GB | Excellent | |
| Q8_0 | 3.6 GB | Near Perfect | |
| FP16 | 5.0 GB | Full |
See which devices can run this model and at what quality level.
| SS | 79.8 tok/s | 2.9 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | SS | 75.3 tok/s | 2.9 GB | |
| SS | 124.1 tok/s | 2.9 GB | ||
| SS | 119.7 tok/s | 2.9 GB | ||
Intel Arc B580Intel | SS | 126.3 tok/s | 2.9 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 139.6 tok/s | 2.9 GB | |
| SS | 139.6 tok/s | 2.9 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 186.2 tok/s | 2.9 GB | |
| AA | 141.8 tok/s | 2.9 GB | ||
| AA | 172.9 tok/s | 2.9 GB | ||
| AA | 177.3 tok/s | 2.9 GB | ||
| AA | 177.3 tok/s | 2.9 GB | ||
Google Cloud TPU v5eGoogle | AA | 226.9 tok/s | 2.9 GB | |
Intel Arc A770 16GBIntel | AA | 155.1 tok/s | 2.9 GB | |
| AA | 265.9 tok/s | 2.9 GB | ||
| AA | 79.8 tok/s | 2.9 GB | ||
| AA | 186.2 tok/s | 2.9 GB | ||
| AA | 203.9 tok/s | 2.9 GB | ||
| AA | 124.1 tok/s | 2.9 GB | ||
| AA | 248.2 tok/s | 2.9 GB | ||
| AA | 265.9 tok/s | 2.9 GB | ||
| AA | 221.6 tok/s | 2.9 GB | ||
| AA | 265.9 tok/s | 2.9 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | AA | 259.3 tok/s | 2.9 GB | |
| AA | 279.2 tok/s | 2.9 GB |
Energy cost on AMD Radeon RX 7600 8GB (~80 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)LFM2.5-8B-A1B on AMD Radeon RX 7600 8GB · ~80 tok/s · 165W | $0.069 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
LFM2.5-8B-A1B is Liquid AI's second-generation on-device Mixture-of-Experts model, designed explicitly for local inference on consumer hardware. With 8.3 billion total parameters and only 1.5 billion active per token, it occupies a unique position: it delivers quality competitive with dense 7B-8B models while requiring a fraction of the compute budget.
Liquid built this model for agentic workflows — tool calling, multi-step reasoning, and instruction following on devices ranging from phones to laptops. It is a text-only reasoning model that produces an explicit chain of thought before its final answer, a design choice that leverages the MoE architecture's compute-bound nature to add quality without sacrificing speed.
The model was pretrained on 38 trillion tokens, up from 12 trillion in its predecessor LFM2-8B-A1B, and underwent large-scale reinforcement learning. The result is a model that scores 91.84 on IFEval, 88.76 on MATH500, and 88.07 on Tau² Telecom — numbers that put it in striking distance of much larger models while running entirely on-device.
LFM2.5-8B-A1B uses a hybrid architecture combining gated short-convolution blocks with grouped-query attention (GQA). The model has 24 layers total: 18 double-gated LIV convolution blocks and 6 GQA blocks. This design, inherited from LFM2, balances local pattern extraction with global attention.
The MoE configuration is what makes this model practical for local use. Out of 8.3B total parameters, only 1.5B are active for any given token. This means inference speed is closer to what you'd expect from a 1.5B dense model, while the model retains the representational capacity of an 8B model. For practitioners, this translates directly to lower VRAM requirements and higher tokens per second compared to dense models of similar quality.
Key specs:
The expanded 128K context window (up from 32K in LFM2-8B-A1B) enables processing of long documents, multi-turn conversations, and extended reasoning chains. The vocabulary was doubled to 128K tokens to improve tokenization efficiency for non-Latin scripts — Liquid reports strong compression gains for Hindi, Thai, Vietnamese, Indonesian, and Arabic.
Unlike its predecessor, LFM2.5-8B-A1B is a reasoning-only model. It generates an explicit chain of thought before its final answer. This is not a gimmick: MoE models are typically compute-bound, and the small active parameter count makes each reasoning token cheap. The quality improvement from reasoning is substantial — the model's AA-Omniscience Index jumped from -78.42 to -24.70, with the non-hallucination rate improving from 7.46% to 63.47%.
LFM2.5-8B-A1B is a general-purpose text model, but it has clear strengths and weaknesses.
Strengths:
Weaknesses:
Concrete use cases:
This is where LFM2.5-8B-A1B justifies its existence. With only 1.5B active parameters, it runs on hardware that would struggle with dense 7B models.
Minimum (quantized, CPU inference):
Recommended (GPU inference):
Ideal:
| Quantization | VRAM Required | Quality Retention | Recommended Use |
|---|---|---|---|
| FP16 | ~16GB | 100% | GPU with 24GB+ VRAM |
| Q8_0 | ~8.5GB | ~99% | GPU with 12-16GB VRAM |
| Q4_K_M | ~5.5GB | ~95% | Best balance for most users |
| Q3_K_M | ~4.5GB | ~90% | 8GB VRAM GPUs |
| Q2_K | ~3.5GB | ~85% | 6GB VRAM or CPU inference |
For most practitioners, Q4_K_M is the sweet spot. It retains 95% of FP16 quality while using only 5.5GB VRAM, leaving room for the KV cache and context. At this quantization, you can run the model on an RTX 3060 12GB, RTX 4060 Ti 16GB, or any Apple Silicon Mac with 16GB+ unified memory.
On an RTX 4090 with Q4_K_M:
On an M4 Max (64GB) with MLX Q8:
On CPU with Q4_K_M (16-core modern CPU):
The fastest way to get up and running is via Ollama:
1ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF:Q4_K_M
For llama.cpp directly:
1llama-cli -hf LiquidAI/LFM2.5-8B-A1B-GGUF -c 4096 --color -i --temp 0.2 --top-k 80 --repeat-penalty 1.05
For Apple Silicon, use MLX:
1pip install mlx-lm2mlx_lm.generate --model LiquidAI/LFM2.5-8B-A1B-MLX-8bit --max-tokens 512
For production inference servers, vLLM and SGLang both support the model with OpenAI-compatible APIs.
Important note: If you downloaded the model before commit feb5e04, re-download the tokenizer files. A tokenizer update fixed tool-calling issues in llama.cpp.
Qwen2.5-7B is a dense model with 7.6B active parameters. It requires ~15GB VRAM at FP16 and ~5GB at Q4_K_M.
LFM2.5-8B-A1B advantages:
Qwen2.5-7B advantages:
Choose LFM2.5-8B-A1B when: You need a local agent that makes many tool calls, runs on limited hardware, or requires multilingual support. Choose Qwen2.5-7B when you need general-purpose chat, coding, or knowledge tasks without the overhead of reasoning tokens.
Phi-3.5-MoE has 6.6B total parameters with ~3.8B active — more than double LFM2.5's active parameters.
LFM2.5-8B-A1B advantages:
Phi-3.5-MoE advantages:
Choose LFM2.5-8B-A1B when: You need the combination of small active parameters, long context, and strong reasoning. Choose Phi-3.5-MoE when you need the smallest possible footprint and don't need long context or multilingual support.