A 1-trillion-parameter scale thinking MoE model (with 63B active parameters) by inclusionAI (Ant Group), optimized for agentic workflows, coding, and long-horizon task execution with adaptive reasoning modes.
A workable 1000B-parameter MoE language model from inclusionAI. Pulls ahead on competition math (AIME 2026) (96/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 155.9 GB | Low | |
| Q4_K_MRecommended | 169.2 GB | Good | |
| Q5_K_M | 175.5 GB | Very Good | |
| Q6_K | 183.0 GB | Excellent | |
| Q8_0 | 198.8 GB | Near Perfect | |
| FP16 | 258.6 GB | Full |
See which devices can run this model and at what quality level.
| SS | 38.1 tok/s | 169.2 GB | ||
| SS | 28.6 tok/s | 169.2 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 38.1 tok/s | 169.2 GB | |
| SS | 33.8 tok/s | 169.2 GB | ||
| SS | 33.8 tok/s | 169.2 GB | ||
| SS | 33.8 tok/s | 169.2 GB | ||
| SS | 33.8 tok/s | 169.2 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 33.8 tok/s | 169.2 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 33.8 tok/s | 169.2 GB | |
Google TPU v7 (Ironwood)Google | AA | 35.1 tok/s | 169.2 GB | |
| AA | 25.2 tok/s | 169.2 GB | ||
| BB | 3.9 tok/s | 169.2 GB | ||
| BB | 3.9 tok/s | 169.2 GB | ||
| BB | 3.8 tok/s | 169.2 GB | ||
| FF | 2.4 tok/s | 169.2 GB | ||
| FF | 1.3 tok/s | 169.2 GB | ||
| FF | 1.4 tok/s | 169.2 GB | ||
| FF | 2.1 tok/s | 169.2 GB | ||
| FF | 3.0 tok/s | 169.2 GB | ||
| FF | 3.8 tok/s | 169.2 GB | ||
| FF | 4.6 tok/s | 169.2 GB | ||
| FF | 3.0 tok/s | 169.2 GB | ||
| FF | 3.0 tok/s | 169.2 GB | ||
Apple M4Apple | FF | 0.6 tok/s | 169.2 GB | |
| FF | 2.6 tok/s | 169.2 GB |
Cheapest current cloud rentals with at least 169 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
AMD Instinct MI300XRunPod · Community · 192 GB VRAM | $0.50 |
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM | $1.99 |
Ring-2.6-1T is a 1-trillion-parameter Mixture-of-Experts (MoE) reasoning model from inclusionAI, the open‑source arm of Ant Group. With 63 billion active parameters out of 1 trillion total, it delivers the reasoning depth of a massive model while keeping per‑token compute within the range of a moderately large dense model. This architecture targets the most demanding real‑world workloads: agentic workflows, complex tool‑use pipelines, long‑horizon task execution, and advanced coding.
The model is released under the MIT license, making it suitable for commercial deployment and modification. It competes directly with other large open‑source MoE models such as DeepSeek‑V3 (671B total, 37B active) and dense giants like Llama 3.1 405B. Where those models excel in general chat and instruction following, Ring‑2.6‑1T is purpose‑built for environments that require sustained reasoning, multi‑step planning, and reliable tool invocation.
Ring‑2.6‑1T uses a standard MoE transformer layout. Of the 1 trillion total parameters, exactly 63 billion are activated for each forward pass. This sparsity decouples memory footprint from compute cost: you still need to load all expert weights into VRAM, but the inference FLOPs are comparable to a 63B dense model.
high and xhigh. high uses a fixed reasoning budget; xhigh dynamically allocates more tokens for deep chain‑of‑thought, useful for math, logic, and multi‑step tool orchestration. You control this via the chat template parameter reasoning_effort.For local inference, the critical spec is total parameter count. At FP16, storing the full 1T weights requires about 2 TB of VRAM. With 4‑bit quantization (Q4_K_M), this drops to ~500 GB. Even at the lowest practical quantization, you are looking at multi‑GPU deployments—there is no single‑consumer‑GPU path for this model.
The model is optimized for four interconnected domains:
Agentic workflows – Ring‑2.6‑1T scores 87.60 on PinchBench (vs. GPT‑5.4 xHigh at ~84) and 63.82 on ClawEval, placing it among the top open‑source agents. It can decompose a user request into sub‑tasks, call tools (APIs, databases, file systems), handle errors, and revisit earlier steps.
Coding agents – Designed for multi‑file patches, repository‑level refactoring, and autonomous bug fixing. It understands diff formats, git history, and CI/CD triggers. On Tau2‑Bench (Telecom scenario) it scored 95.32, meaning it rarely fails in realistic software engineering loops.
Long‑horizon reasoning – Tasks that take minutes or hours—like scientific simulation planning, legal document analysis, or supply chain optimization—benefit from the 262K context and stable RL‑trained memory. The model maintains coherence across hundreds of tool calls or reasoning steps.
Function‑calling & instruction following – The chat template natively supports tool definitions in XML and multi‑step tool invocations. You can define a set of functions and the model will call them sequentially, re‑evaluating state after each return.
| Quantization | Total VRAM required | Minimum GPU configuration | Recommended GPU configuration |
|---|---|---|---|
| FP16 | ~2000 GB | 8× A100 80GB (NVLink) | 16× A100 80GB or 8× H100 94GB |
| Q8 | ~1000 GB | 8× A100 80GB | 8× H100 94GB |
| Q4_K_M | ~500 GB | 4× A100 80GB | 8× A100 80GB (for headroom) |
| Q2_K (experimental) | ~250 GB | 2× A100 80GB (tight) | 4× A100 80GB |
No consumer GPU (RTX 4090, M4 Max, etc.) can run this model even at Q2 because 250 GB exceeds single‑card memory. Use cases on consumer hardware are limited to CPU offloading (impractical at this scale) or cloud‑based inference.
The quickest path for multi‑GPU systems is vLLM with tensor parallelism:
1docker run --gpus all -v /path/to/model:/model vllm/vllm \2 --model /model/Ring-2.6-1T \3 --tensor-parallel-size 8 \4 --dtype bfloat16 \5 --quantization fp8 # or use --quantization awq for 4-bit
For llama.cpp (with Q4_K_M):
1./llama-server -m Ring-2.6-1T-Q4_K_M.gguf --parallel 1 --ctx-size 262144 --ngl 99 --n-gpu-layers 80
Ollama support is not yet official as of mid‑2026, but you can create a custom Modelfile using the GGUF quantization.
| Task type | Tokens/second (input) | Tokens/second (output) |
|---|---|---|
| Short prompts (1-2K tokens) | ~15 | ~12 |
| Long context (100K tokens) | ~4 | ~3 |
| Agent loops (streaming) | ~8 | ~6 |
Faster on H100s or with FP8 quantization (if supported by your inference engine).
Ring‑2.6‑1T vs DeepSeek‑V3 (671B total, 37B active)
Ring‑2.6‑1T vs Llama 3.1 405B (dense)
When to choose Ring‑2.6‑1T: You need a unified model for agentic automation, high‑quality code generation, and multimodal-free reasoning over very long contexts. You have access to a multi‑GPU server (8× A100 or better) and accept the hardware cost in exchange for top‑tier open‑source agent performance.
When to choose an alternative: Your hardware is limited to 2‑4 GPUs, your tasks are mostly short‑format Q&A, or you need to deploy on consumer GPUs today.
| $1.99 |
NVIDIA B200Vast.ai · Spot · 192 GB VRAM | $2.69 |
NVIDIA B200Vast.ai · On-Demand · 192 GB VRAM | $3.78 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Energy cost on AMD Instinct MI300X (~25 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Ring-2.6-1T on AMD Instinct MI300X · ~25 tok/s · 750W | $0.991 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.