A 3B-parameter dense reasoning model from WeiboAI, fine-tuned from Qwen2.5-Coder-3B and released under the MIT license. It uses a Spectrum-to-Signal post-training pipeline and targets verifiable reasoning in math, coding, and STEM rather than general chat or tool use. It scores 94.3 on AIME 2026, 89.3 on HMMT 2025, 70.2 on GPQA-Diamond, and 80.2 Pass@1 on LiveCodeBench v6. With its Claim-Level Reliability test-time scaling the AIME 2026 score rises to 97.1, which the authors report rivals much larger frontier models.
A workable 3B-parameter dense language model from WeiboAI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 3.2 GB | Low | |
| Q4_K_MRecommended | 3.8 GB | Good | |
| Q5_K_M | 4.1 GB | Very Good | |
| Q6_K | 4.5 GB | Excellent | |
| Q8_0 | 5.2 GB | Near Perfect | |
| FP16 | 8.1 GB | Full |
See which devices can run this model and at what quality level.
| AA | 60.8 tok/s | 3.8 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | AA | 57.4 tok/s | 3.8 GB | |
| AA | 94.6 tok/s | 3.8 GB | ||
| AA | 91.2 tok/s | 3.8 GB | ||
Intel Arc B580Intel | AA | 96.3 tok/s | 3.8 GB | |
NVIDIA GeForce RTX 4070NVIDIA | AA | 106.4 tok/s | 3.8 GB | |
| AA | 106.4 tok/s | 3.8 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | AA | 141.9 tok/s | 3.8 GB | |
| AA | 108.1 tok/s | 3.8 GB | ||
| AA | 131.7 tok/s | 3.8 GB | ||
| AA | 135.1 tok/s | 3.8 GB | ||
| AA | 135.1 tok/s | 3.8 GB | ||
Google Cloud TPU v5eGoogle | AA | 172.9 tok/s | 3.8 GB | |
Intel Arc A770 16GBIntel | AA | 118.2 tok/s | 3.8 GB | |
| AA | 202.7 tok/s | 3.8 GB | ||
| AA | 60.8 tok/s | 3.8 GB | ||
| AA | 141.9 tok/s | 3.8 GB | ||
| AA | 155.4 tok/s | 3.8 GB | ||
| AA | 94.6 tok/s | 3.8 GB | ||
| AA | 189.2 tok/s | 3.8 GB | ||
| AA | 202.7 tok/s | 3.8 GB | ||
| AA | 168.9 tok/s | 3.8 GB | ||
| AA | 202.7 tok/s | 3.8 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | AA | 197.6 tok/s | 3.8 GB | |
| AA | 212.8 tok/s | 3.8 GB |
Energy cost on AMD Radeon RX 7600 8GB (~61 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)VibeThinker-3B on AMD Radeon RX 7600 8GB · ~61 tok/s · 165W | $0.090 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
VibeThinker-3B is a 3-billion-parameter dense reasoning model from WeiboAI, fine-tuned from Qwen2.5-Coder-3B and released under the MIT license. It is not a general-purpose chatbot or agent framework — it is a specialized verifiable-reasoning engine designed for math, competitive programming, and STEM problems where answers can be checked precisely. This focus is what makes it stand out: the model’s post-training pipeline (Spectrum-to-Signal) optimizes for tasks with clear correctness signals, not open-ended dialogue or tool use.
At 3B parameters, VibeThinker-3B competes with other small-to-medium reasoning models (e.g., DeepSeek-R1-Distill-Qwen-1.5B, Qwen2.5-Coder-3B-Instruct) but punches far above its weight on benchmarks. It scores 94.3 on AIME 2026, 89.3 on HMMT 2025, 70.2 on GPQA-Diamond, and 80.2 Pass@1 on LiveCodeBench v6. With claim-level reliability (CLR) test-time scaling, the AIME 2026 score jumps to 97.1 — rivaling frontier models orders of magnitude larger. This makes it a compelling option for practitioners who need strong reasoning in a package small enough to run on consumer hardware.
VibeThinker-3B is a dense transformer with 3B parameters — no mixture-of-experts, no sparse activation. This means all 3B parameters are active for every forward pass, making inference predictable in terms of VRAM and throughput. The architecture is based on Qwen2.5-Coder-3B, which uses a standard decoder-only causal attention design with RoPE (Rotary Position Embedding) and SwiGLU feedforward layers.
Context length is 65,536 tokens — unusually long for a 3B model. This is critical for multi-step reasoning problems (e.g., solving a 10-page math proof or debugging a 2,000-line code snippet). When using the model on hard benchmarks like AMOBench, WeiboAI recommends setting max_tokens to 60K–100K to allow the model enough space to think through extreme difficulty problems.
The model uses bfloat16 precision natively; 4-bit quantization (e.g., Q4_K_M via llama.cpp) reduces VRAM to roughly 2.5–3 GB while retaining most of the reasoning quality. No MoE means no routing overhead — latency is purely a function of prompt and generation length.
VibeThinker-3B excels where the answer can be verified against a ground truth — what WeiboAI calls “verifiable reasoning.” This includes:
What it is not for: tool-calling, autonomous coding agents, API orchestration, or general-purpose conversation. The model was intentionally not trained on function-calling or agent data. Use it for batch evaluation of challenging problems, automated grading, or as a reasoning engine in a pipeline that provides verification signals.
This is where VibeThinker-3B shines for practitioners. The 3B dense architecture fits comfortably on consumer GPUs and even some laptops.
VRAM requirements (inference):
| Quantization | VRAM (approx) | Quality impact |
|---|---|---|
| bfloat16 (native) | 6–7 GB | Full precision |
| Q4_K_M (llama.cpp) | 2.5–3 GB | Minimal loss |
| Q3_K_M | 2 GB | Slight degradation |
| Q2_K | 1.5 GB | Noticeable drop |
Recommended hardware:
Expected tokens per second (single batch, 4090 at bfloat16): ~80–100 t/s for generation, ~150–200 t/s ingest. With Q4_K_M, expect 120–150 t/s on the same GPU.
Quickest way to start: Use Ollama. Pull the model (once it’s added to the official library) or run it directly with llama.cpp using a GGUF conversion. Example command:
1ollama run vibethinker-3b
For maximum control, download the HuggingFace transformers checkpoint and load it in your Python script:
1from transformers import AutoModelForCausalLM, AutoTokenizer23model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-3B", torch_dtype="bfloat16").to("cuda")4tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B")
vs. Qwen2.5-Coder-3B-Instruct (base model):
vs. DeepSeek-R1-Distill-Qwen-1.5B (similar small reasoning model):
Verifiable reasoning specialty: VibeThinker-3B deliberately sacrifices broad knowledge and conversational ability to maximize performance on tasks with clear answers. If you need a model for open-ended Q&A or creative writing, look elsewhere. If you need a reasoning copilot that can solve IMO-level problems and LeetCode hards on your local machine, this is the best option at its size.