WeiboAI

VibeThinker-3B

A 3B-parameter dense reasoning model from WeiboAI, fine-tuned from Qwen2.5-Coder-3B and released under the MIT license. It uses a Spectrum-to-Signal post-training pipeline and targets verifiable reasoning in math, coding, and STEM rather than general chat or tool use. It scores 94.3 on AIME 2026, 89.3 on HMMT 2025, 70.2 on GPQA-Diamond, and 80.2 Pass@1 on LiveCodeBench v6. With its Claim-Level Reliability test-time scaling the AIME 2026 score rises to 97.1, which the authors report rivals much larger frontier models.

3B paramsDense66K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Local inference under 16 GB VRAM

A workable 3B-parameter dense language model from WeiboAI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. Newly released, so production-readiness is still being shaken out.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~3.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Reasoning

Math

Code Generation

Model Specifications

Parameters3B

ArchitectureDense

Context Length66K tokens

ModalityText Only

ProviderWeiboAI

Download Size6.2 GB

Community

Monthly Downloads54.6K

Likes719

Last Updated8 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

51.0CC

Benchmark40%

45.0

Popularity25%

20.7

Efficiency20%

95.8

Versatility15%

57.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	3.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	3.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	4.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	4.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	5.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	8.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	AA	60.8 tok/s	3.8 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	57.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	AA	94.6 tok/s	3.8 GB
AMD Radeon RX 7700 XTAMD	AA	91.2 tok/s	3.8 GB
Intel Arc B580Intel	AA	96.3 tok/s	3.8 GB
NVIDIA GeForce RTX 4070NVIDIA	AA	106.4 tok/s	3.8 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	AA	106.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5070NVIDIA	AA	141.9 tok/s	3.8 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	108.1 tok/s	3.8 GB
AMD Radeon RX 7800 XTAMD	AA	131.7 tok/s	3.8 GB
AMD Radeon RX 9070AMD	AA	135.1 tok/s	3.8 GB
AMD Radeon RX 9070 XTAMD	AA	135.1 tok/s	3.8 GB
Google Cloud TPU v5eGoogle	AA	172.9 tok/s	3.8 GB
Intel Arc A770 16GBIntel	AA	118.2 tok/s	3.8 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	202.7 tok/s	3.8 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	60.8 tok/s	3.8 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	141.9 tok/s	3.8 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	155.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	94.6 tok/s	3.8 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	189.2 tok/s	3.8 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	202.7 tok/s	3.8 GB
AMD Radeon RX 7900 XTAMD	AA	168.9 tok/s	3.8 GB
AMD Radeon RX 7900 XTXAMD	AA	202.7 tok/s	3.8 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	197.6 tok/s	3.8 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	212.8 tok/s	3.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~61 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)VibeThinker-3B on AMD Radeon RX 7600 8GB · ~61 tok/s · 165W	$0.090
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

VibeThinker-3B is a 3-billion-parameter dense reasoning model from WeiboAI, fine-tuned from Qwen2.5-Coder-3B and released under the MIT license. It is not a general-purpose chatbot or agent framework — it is a specialized verifiable-reasoning engine designed for math, competitive programming, and STEM problems where answers can be checked precisely. This focus is what makes it stand out: the model’s post-training pipeline (Spectrum-to-Signal) optimizes for tasks with clear correctness signals, not open-ended dialogue or tool use.

At 3B parameters, VibeThinker-3B competes with other small-to-medium reasoning models (e.g., DeepSeek-R1-Distill-Qwen-1.5B, Qwen2.5-Coder-3B-Instruct) but punches far above its weight on benchmarks. It scores 94.3 on AIME 2026, 89.3 on HMMT 2025, 70.2 on GPQA-Diamond, and 80.2 Pass@1 on LiveCodeBench v6. With claim-level reliability (CLR) test-time scaling, the AIME 2026 score jumps to 97.1 — rivaling frontier models orders of magnitude larger. This makes it a compelling option for practitioners who need strong reasoning in a package small enough to run on consumer hardware.

Architecture & Technical Details

VibeThinker-3B is a dense transformer with 3B parameters — no mixture-of-experts, no sparse activation. This means all 3B parameters are active for every forward pass, making inference predictable in terms of VRAM and throughput. The architecture is based on Qwen2.5-Coder-3B, which uses a standard decoder-only causal attention design with RoPE (Rotary Position Embedding) and SwiGLU feedforward layers.

Context length is 65,536 tokens — unusually long for a 3B model. This is critical for multi-step reasoning problems (e.g., solving a 10-page math proof or debugging a 2,000-line code snippet). When using the model on hard benchmarks like AMOBench, WeiboAI recommends setting max_tokens to 60K–100K to allow the model enough space to think through extreme difficulty problems.

The model uses bfloat16 precision natively; 4-bit quantization (e.g., Q4_K_M via llama.cpp) reduces VRAM to roughly 2.5–3 GB while retaining most of the reasoning quality. No MoE means no routing overhead — latency is purely a function of prompt and generation length.

Capabilities & Use Cases

VibeThinker-3B excels where the answer can be verified against a ground truth — what WeiboAI calls “verifiable reasoning.” This includes:

Mathematical reasoning: Olympiad-level competition problems (AIME, HMMT, IMO AnswerBench). The model can handle multi-step algebra, geometry, number theory, and combinatorics.
Competitive programming: LeetCode-style algorithmic problems, especially hard (H) and extreme (E) difficulty. It scores 80.2% Pass@1 on LiveCodeBench v6 and 96.1% acceptance rate on recent unseen LeetCode contests.
STEM reasoning: Physics, chemistry, and engineering problems that require step-by-step derivation and constraint satisfaction.
Instruction following with explicit constraints: IFEval score of 93.4 confirms it respects formatting, length, and content restrictions — useful for generating structured reports.

What it is not for: tool-calling, autonomous coding agents, API orchestration, or general-purpose conversation. The model was intentionally not trained on function-calling or agent data. Use it for batch evaluation of challenging problems, automated grading, or as a reasoning engine in a pipeline that provides verification signals.

Running VibeThinker-3B Locally

This is where VibeThinker-3B shines for practitioners. The 3B dense architecture fits comfortably on consumer GPUs and even some laptops.

VRAM requirements (inference):

Quantization	VRAM (approx)	Quality impact
bfloat16 (native)	6–7 GB	Full precision
Q4_K_M (llama.cpp)	2.5–3 GB	Minimal loss
Q3_K_M	2 GB	Slight degradation
Q2_K	1.5 GB	Noticeable drop

Recommended hardware:

Minimum: NVIDIA RTX 3060 (12 GB) — can run bfloat16 comfortably with headroom for context.
Recommended: RTX 4090 (24 GB) — allows full 65K context at bfloat16 with fast generation.
Apple Silicon: M4 Max (128 GB unified memory) — runs at full precision easily. M3 Pro (18 GB) can handle Q4_K_M with moderate context.
CPU-only: Possible with llama.cpp on a modern CPU (e.g., AMD Ryzen 9) at ~1–2 tokens/sec with Q4_K_M and 64 GB RAM.

Expected tokens per second (single batch, 4090 at bfloat16): ~80–100 t/s for generation, ~150–200 t/s ingest. With Q4_K_M, expect 120–150 t/s on the same GPU.

Quickest way to start: Use Ollama. Pull the model (once it’s added to the official library) or run it directly with llama.cpp using a GGUF conversion. Example command:

1ollama run vibethinker-3b

For maximum control, download the HuggingFace transformers checkpoint and load it in your Python script:

1from transformers import AutoModelForCausalLM, AutoTokenizer
2
3model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-3B", torch_dtype="bfloat16").to("cuda")
4tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B")

How It Compares

vs. Qwen2.5-Coder-3B-Instruct (base model):

VibeThinker-3B is significantly stronger on math and reasoning benchmarks but weaker on general coding tasks (e.g., function calling, agentic workflows). If your workload is strictly competitive programming or math, VibeThinker is the better pick. If you need a balanced coding assistant, Qwen2.5-Coder is more versatile.

vs. DeepSeek-R1-Distill-Qwen-1.5B (similar small reasoning model):

VibeThinker-3B has 2x parameters and dramatically higher benchmark scores (e.g., AIME 94.3 vs. ~30 for the 1.5B distill). It also supports a much longer context (65K vs. 32K). The tradeoff is VRAM: VibeThinker-3B requires ~3 GB at Q4 vs. ~1.5 GB for the 1.5B model. For those with at least an RTX 3060, VibeThinker-3B offers frontier-level reasoning in a small package.

Verifiable reasoning specialty: VibeThinker-3B deliberately sacrifices broad knowledge and conversational ability to maximize performance on tasks with clear answers. If you need a model for open-ended Q&A or creative writing, look elsewhere. If you need a reasoning copilot that can solve IMO-level problems and LeetCode hards on your local machine, this is the best option at its size.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

WeiboAI

VibeThinker-3B

3B paramsDense66K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Local inference under 16 GB VRAM

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~3.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Reasoning

Math

Code Generation

Model Specifications

Parameters3B

ArchitectureDense

Context Length66K tokens

ModalityText Only

ProviderWeiboAI

Download Size6.2 GB

Community

Monthly Downloads54.6K

Likes719

Last Updated8 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

51.0CC

Benchmark40%

45.0

Popularity25%

20.7

Efficiency20%

95.8

Versatility15%

57.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	3.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	3.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	4.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	4.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	5.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	8.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	AA	60.8 tok/s	3.8 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	57.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	AA	94.6 tok/s	3.8 GB
AMD Radeon RX 7700 XTAMD	AA	91.2 tok/s	3.8 GB
Intel Arc B580Intel	AA	96.3 tok/s	3.8 GB
NVIDIA GeForce RTX 4070NVIDIA	AA	106.4 tok/s	3.8 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	AA	106.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5070NVIDIA	AA	141.9 tok/s	3.8 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	108.1 tok/s	3.8 GB
AMD Radeon RX 7800 XTAMD	AA	131.7 tok/s	3.8 GB
AMD Radeon RX 9070AMD	AA	135.1 tok/s	3.8 GB
AMD Radeon RX 9070 XTAMD	AA	135.1 tok/s	3.8 GB
Google Cloud TPU v5eGoogle	AA	172.9 tok/s	3.8 GB
Intel Arc A770 16GBIntel	AA	118.2 tok/s	3.8 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	202.7 tok/s	3.8 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	60.8 tok/s	3.8 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	141.9 tok/s	3.8 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	155.4 tok/s	3.8 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	94.6 tok/s	3.8 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	189.2 tok/s	3.8 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	202.7 tok/s	3.8 GB
AMD Radeon RX 7900 XTAMD	AA	168.9 tok/s	3.8 GB
AMD Radeon RX 7900 XTXAMD	AA	202.7 tok/s	3.8 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	197.6 tok/s	3.8 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	212.8 tok/s	3.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~61 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)VibeThinker-3B on AMD Radeon RX 7600 8GB · ~61 tok/s · 165W	$0.090
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

VibeThinker-3B excels where the answer can be verified against a ground truth — what WeiboAI calls “verifiable reasoning.” This includes:

Mathematical reasoning: Olympiad-level competition problems (AIME, HMMT, IMO AnswerBench). The model can handle multi-step algebra, geometry, number theory, and combinatorics.
Competitive programming: LeetCode-style algorithmic problems, especially hard (H) and extreme (E) difficulty. It scores 80.2% Pass@1 on LiveCodeBench v6 and 96.1% acceptance rate on recent unseen LeetCode contests.
STEM reasoning: Physics, chemistry, and engineering problems that require step-by-step derivation and constraint satisfaction.
Instruction following with explicit constraints: IFEval score of 93.4 confirms it respects formatting, length, and content restrictions — useful for generating structured reports.

Running VibeThinker-3B Locally

This is where VibeThinker-3B shines for practitioners. The 3B dense architecture fits comfortably on consumer GPUs and even some laptops.

VRAM requirements (inference):

Quantization	VRAM (approx)	Quality impact
bfloat16 (native)	6–7 GB	Full precision
Q4_K_M (llama.cpp)	2.5–3 GB	Minimal loss
Q3_K_M	2 GB	Slight degradation
Q2_K	1.5 GB	Noticeable drop

Recommended hardware:

Minimum: NVIDIA RTX 3060 (12 GB) — can run bfloat16 comfortably with headroom for context.
Recommended: RTX 4090 (24 GB) — allows full 65K context at bfloat16 with fast generation.
Apple Silicon: M4 Max (128 GB unified memory) — runs at full precision easily. M3 Pro (18 GB) can handle Q4_K_M with moderate context.
CPU-only: Possible with llama.cpp on a modern CPU (e.g., AMD Ryzen 9) at ~1–2 tokens/sec with Q4_K_M and 64 GB RAM.

Expected tokens per second (single batch, 4090 at bfloat16): ~80–100 t/s for generation, ~150–200 t/s ingest. With Q4_K_M, expect 120–150 t/s on the same GPU.

Quickest way to start: Use Ollama. Pull the model (once it’s added to the official library) or run it directly with llama.cpp using a GGUF conversion. Example command:

1ollama run vibethinker-3b

For maximum control, download the HuggingFace transformers checkpoint and load it in your Python script:

1from transformers import AutoModelForCausalLM, AutoTokenizer
2
3model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-3B", torch_dtype="bfloat16").to("cuda")
4tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B")

How It Compares

vs. Qwen2.5-Coder-3B-Instruct (base model):

VibeThinker-3B is significantly stronger on math and reasoning benchmarks but weaker on general coding tasks (e.g., function calling, agentic workflows). If your workload is strictly competitive programming or math, VibeThinker is the better pick. If you need a balanced coding assistant, Qwen2.5-Coder is more versatile.

vs. DeepSeek-R1-Distill-Qwen-1.5B (similar small reasoning model):

VibeThinker-3B has 2x parameters and dramatically higher benchmark scores (e.g., AIME 94.3 vs. ~30 for the 1.5B distill). It also supports a much longer context (65K vs. 32K). The tradeoff is VRAM: VibeThinker-3B requires ~3 GB at Q4 vs. ~1.5 GB for the 1.5B model. For those with at least an RTX 3060, VibeThinker-3B offers frontier-level reasoning in a small package.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.