inclusionAI

Ring-2.6-1T

A 1-trillion-parameter scale thinking MoE model (with 63B active parameters) by inclusionAI (Ant Group), optimized for agentic workflows, coding, and long-horizon task execution with adaptive reasoning modes.

1000B paramsMoE262K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A situational 1000B-parameter MoE language model from inclusionAI. Pulls ahead on competition math (AIME 2026) (96/100), so reach for it when that's the dimension that matters.

Run this onAMD Instinct MI325XCheapest card in our directory with comfortable headroom (256 GB) for this model at Q4 (~169.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters1000B

Active Params63B

ArchitectureMoE

Context Length262K tokens

ModalityText Only

ProviderinclusionAI

Download Size1.0 TB

Community

Monthly Downloads823

Likes103

Last Updated20 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

85.7

74.0

18.3

95.8

AA Intelligence Index

30.6

42.4

28.8

44.6

64.3

MBA Open Score

37.9DD

Benchmark40%

53.8

Popularity25%

14.6

Efficiency20%

16.9

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	155.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	169.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	175.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	183.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	198.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	258.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Instinct MI355XAMD	SS	38.1 tok/s	169.2 GB
AMD Instinct MI325XAMD	SS	28.6 tok/s	169.2 GB
NVIDIA B200 GPUNVIDIA	SS	38.1 tok/s	169.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	33.8 tok/s	169.2 GB
Dell Pro Max with GB300Dell	SS	33.8 tok/s	169.2 GB
HP ZGX Fury AI StationHP	SS	33.8 tok/s	169.2 GB
MSI XpertStation WS300MSI	SS	33.8 tok/s	169.2 GB
SuperMicro Super AI StationSuperMicro	SS	33.8 tok/s	169.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	33.8 tok/s	169.2 GB
Google TPU v7 (Ironwood)Google	AA	35.1 tok/s	169.2 GB
AMD Instinct MI300XAMD	AA	25.2 tok/s	169.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	3.9 tok/s	169.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	3.9 tok/s	169.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	3.8 tok/s	169.2 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	2.4 tok/s	169.2 GB
Acer Veriton GN100 AI MiniAcer	FF	1.3 tok/s	169.2 GB
AMD Radeon RX 7600 8GBAMD	FF	1.4 tok/s	169.2 GB
AMD Radeon RX 7700 XTAMD	FF	2.1 tok/s	169.2 GB
AMD Radeon RX 7800 XTAMD	FF	3.0 tok/s	169.2 GB
AMD Radeon RX 7900 XTAMD	FF	3.8 tok/s	169.2 GB
AMD Radeon RX 7900 XTXAMD	FF	4.6 tok/s	169.2 GB
AMD Radeon RX 9070AMD	FF	3.0 tok/s	169.2 GB
AMD Radeon RX 9070 XTAMD	FF	3.0 tok/s	169.2 GB
Apple M4Apple	FF	0.6 tok/s	169.2 GB
Apple M4 Max (40-core GPU)Apple	FF	2.6 tok/s	169.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Instinct MI300X (~25 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Ring-2.6-1T on AMD Instinct MI300X · ~25 tok/s · 750W	$0.991
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 169 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$2.19
NVIDIA B200RunPod · Spot · 192 GB VRAM	$5.49
NVIDIA B200RunPod · Secure · 192 GB VRAM	$5.89

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Ring-2.6-1T is a 1-trillion-parameter Mixture-of-Experts (MoE) reasoning model from inclusionAI, the open‑source arm of Ant Group. With 63 billion active parameters out of 1 trillion total, it delivers the reasoning depth of a massive model while keeping per‑token compute within the range of a moderately large dense model. This architecture targets the most demanding real‑world workloads: agentic workflows, complex tool‑use pipelines, long‑horizon task execution, and advanced coding.

The model is released under the MIT license, making it suitable for commercial deployment and modification. It competes directly with other large open‑source MoE models such as DeepSeek‑V3 (671B total, 37B active) and dense giants like Llama 3.1 405B. Where those models excel in general chat and instruction following, Ring‑2.6‑1T is purpose‑built for environments that require sustained reasoning, multi‑step planning, and reliable tool invocation.

Architecture & Technical Details

Ring‑2.6‑1T uses a standard MoE transformer layout. Of the 1 trillion total parameters, exactly 63 billion are activated for each forward pass. This sparsity decouples memory footprint from compute cost: you still need to load all expert weights into VRAM, but the inference FLOPs are comparable to a 63B dense model.

Context length: 262,144 tokens (extended from 128K via YaRN positional interpolation). This enables processing of entire codebases, long conversation histories, or multi‑turn agent traces without truncation.
Reasoning effort mechanism: Two modes – high and xhigh. high uses a fixed reasoning budget; xhigh dynamically allocates more tokens for deep chain‑of‑thought, useful for math, logic, and multi‑step tool orchestration. You control this via the chat template parameter reasoning_effort.
Training paradigm: Ant Group employed an asynchronous reinforcement learning scheme (Async RL) with the IcePop algorithm to stabilize long‑horizon training at the trillion‑parameter scale. This is why the model performs consistently across long agent sessions without degradation.

For local inference, the critical spec is total parameter count. At FP16, storing the full 1T weights requires about 2 TB of VRAM. With 4‑bit quantization (Q4_K_M), this drops to ~500 GB. Even at the lowest practical quantization, you are looking at multi‑GPU deployments—there is no single‑consumer‑GPU path for this model.

Capabilities & Use Cases

The model is optimized for four interconnected domains:

Agentic workflows – Ring‑2.6‑1T scores 87.60 on PinchBench (vs. GPT‑5.4 xHigh at ~84) and 63.82 on ClawEval, placing it among the top open‑source agents. It can decompose a user request into sub‑tasks, call tools (APIs, databases, file systems), handle errors, and revisit earlier steps.

Coding agents – Designed for multi‑file patches, repository‑level refactoring, and autonomous bug fixing. It understands diff formats, git history, and CI/CD triggers. On Tau2‑Bench (Telecom scenario) it scored 95.32, meaning it rarely fails in realistic software engineering loops.

Long‑horizon reasoning – Tasks that take minutes or hours—like scientific simulation planning, legal document analysis, or supply chain optimization—benefit from the 262K context and stable RL‑trained memory. The model maintains coherence across hundreds of tool calls or reasoning steps.

Function‑calling & instruction following – The chat template natively supports tool definitions in XML and multi‑step tool invocations. You can define a set of functions and the model will call them sequentially, re‑evaluating state after each return.

Running Ring-2.6-1T Locally

Hardware requirements

Quantization	Total VRAM required	Minimum GPU configuration	Recommended GPU configuration
FP16	~2000 GB	8× A100 80GB (NVLink)	16× A100 80GB or 8× H100 94GB
Q8	~1000 GB	8× A100 80GB	8× H100 94GB
Q4_K_M	~500 GB	4× A100 80GB	8× A100 80GB (for headroom)
Q2_K (experimental)	~250 GB	2× A100 80GB (tight)	4× A100 80GB

No consumer GPU (RTX 4090, M4 Max, etc.) can run this model even at Q2 because 250 GB exceeds single‑card memory. Use cases on consumer hardware are limited to CPU offloading (impractical at this scale) or cloud‑based inference.

Software setup

The quickest path for multi‑GPU systems is vLLM with tensor parallelism:

1docker run --gpus all -v /path/to/model:/model vllm/vllm \
2  --model /model/Ring-2.6-1T \
3  --tensor-parallel-size 8 \
4  --dtype bfloat16 \
5  --quantization fp8  # or use --quantization awq for 4-bit

For llama.cpp (with Q4_K_M):

1./llama-server -m Ring-2.6-1T-Q4_K_M.gguf --parallel 1 --ctx-size 262144 --ngl 99 --n-gpu-layers 80

Ollama support is not yet official as of mid‑2026, but you can create a custom Modelfile using the GGUF quantization.

Expected performance (8× A100 80GB, Q4_K_M)

Task type	Tokens/second (input)	Tokens/second (output)
Short prompts (1-2K tokens)	~15	~12
Long context (100K tokens)	~4	~3
Agent loops (streaming)	~8	~6

Faster on H100s or with FP8 quantization (if supported by your inference engine).

How It Compares

Ring‑2.6‑1T vs DeepSeek‑V3 (671B total, 37B active)

DeepSeek‑V3 has a smaller total model, so lower VRAM requirements (roughly half at the same quantization). For teams with 4‑ or 8‑GPU nodes, DeepSeek‑V3 is easier to deploy.
Ring‑2.6‑1T has more active parameters (63B vs 37B) and significantly longer context (262K vs 128K). It also includes built‑in reasoning effort modes, while DeepSeek‑V3 requires manual prompting.
On agent benchmarks (PinchBench, ClawEval), Ring‑2.6‑1T outperforms DeepSeek‑V3 by 5‑10%. For general chat and code completion, the two are comparable.

Ring‑2.6‑1T vs Llama 3.1 405B (dense)

Llama 3.1 405B requires ~800 GB at FP16, so it fits on 4× A100 80GB—a more accessible setup. Ring‑2.6‑1T requires significantly more VRAM.
Llama 3.1 405B is a dense model with no MoE active‑parameter savings, so per‑token compute is higher. Ring‑2.6‑1T’s MoE gives lower latency relative to parameter count.
For agent tasks and long‑context reasoning, Ring‑2.6‑1T is the clear winner. For straightforward instruction following and multilingual tasks, Llama 3.1 405B remains a strong option.

When to choose Ring‑2.6‑1T: You need a unified model for agentic automation, high‑quality code generation, and multimodal-free reasoning over very long contexts. You have access to a multi‑GPU server (8× A100 or better) and accept the hardware cost in exchange for top‑tier open‑source agent performance.

When to choose an alternative: Your hardware is limited to 2‑4 GPUs, your tasks are mostly short‑format Q&A, or you need to deploy on consumer GPUs today.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

inclusionAI

Ring-2.6-1T

1000B paramsMoE262K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A situational 1000B-parameter MoE language model from inclusionAI. Pulls ahead on competition math (AIME 2026) (96/100), so reach for it when that's the dimension that matters.

Run this onAMD Instinct MI325XCheapest card in our directory with comfortable headroom (256 GB) for this model at Q4 (~169.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters1000B

Active Params63B

ArchitectureMoE

Context Length262K tokens

ModalityText Only

ProviderinclusionAI

Download Size1.0 TB

Community

Monthly Downloads823

Likes103

Last Updated20 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

85.7

74.0

18.3

95.8

AA Intelligence Index

30.6

42.4

28.8

44.6

64.3

MBA Open Score

37.9DD

Benchmark40%

53.8

Popularity25%

14.6

Efficiency20%

16.9

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	155.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	169.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	175.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	183.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	198.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	258.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Instinct MI355XAMD	SS	38.1 tok/s	169.2 GB
AMD Instinct MI325XAMD	SS	28.6 tok/s	169.2 GB
NVIDIA B200 GPUNVIDIA	SS	38.1 tok/s	169.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	33.8 tok/s	169.2 GB
Dell Pro Max with GB300Dell	SS	33.8 tok/s	169.2 GB
HP ZGX Fury AI StationHP	SS	33.8 tok/s	169.2 GB
MSI XpertStation WS300MSI	SS	33.8 tok/s	169.2 GB
SuperMicro Super AI StationSuperMicro	SS	33.8 tok/s	169.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	33.8 tok/s	169.2 GB
Google TPU v7 (Ironwood)Google	AA	35.1 tok/s	169.2 GB
AMD Instinct MI300XAMD	AA	25.2 tok/s	169.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	3.9 tok/s	169.2 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	3.9 tok/s	169.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	3.8 tok/s	169.2 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	2.4 tok/s	169.2 GB
Acer Veriton GN100 AI MiniAcer	FF	1.3 tok/s	169.2 GB
AMD Radeon RX 7600 8GBAMD	FF	1.4 tok/s	169.2 GB
AMD Radeon RX 7700 XTAMD	FF	2.1 tok/s	169.2 GB
AMD Radeon RX 7800 XTAMD	FF	3.0 tok/s	169.2 GB
AMD Radeon RX 7900 XTAMD	FF	3.8 tok/s	169.2 GB
AMD Radeon RX 7900 XTXAMD	FF	4.6 tok/s	169.2 GB
AMD Radeon RX 9070AMD	FF	3.0 tok/s	169.2 GB
AMD Radeon RX 9070 XTAMD	FF	3.0 tok/s	169.2 GB
Apple M4Apple	FF	0.6 tok/s	169.2 GB
Apple M4 Max (40-core GPU)Apple	FF	2.6 tok/s	169.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Instinct MI300X (~25 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Ring-2.6-1T on AMD Instinct MI300X · ~25 tok/s · 750W	$0.991
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 169 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$2.19
NVIDIA B200RunPod · Spot · 192 GB VRAM	$5.49
NVIDIA B200RunPod · Secure · 192 GB VRAM	$5.89

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Context length: 262,144 tokens (extended from 128K via YaRN positional interpolation). This enables processing of entire codebases, long conversation histories, or multi‑turn agent traces without truncation.
Reasoning effort mechanism: Two modes – high and xhigh. high uses a fixed reasoning budget; xhigh dynamically allocates more tokens for deep chain‑of‑thought, useful for math, logic, and multi‑step tool orchestration. You control this via the chat template parameter reasoning_effort.
Training paradigm: Ant Group employed an asynchronous reinforcement learning scheme (Async RL) with the IcePop algorithm to stabilize long‑horizon training at the trillion‑parameter scale. This is why the model performs consistently across long agent sessions without degradation.

Capabilities & Use Cases

The model is optimized for four interconnected domains:

Running Ring-2.6-1T Locally

Hardware requirements

Quantization	Total VRAM required	Minimum GPU configuration	Recommended GPU configuration
FP16	~2000 GB	8× A100 80GB (NVLink)	16× A100 80GB or 8× H100 94GB
Q8	~1000 GB	8× A100 80GB	8× H100 94GB
Q4_K_M	~500 GB	4× A100 80GB	8× A100 80GB (for headroom)
Q2_K (experimental)	~250 GB	2× A100 80GB (tight)	4× A100 80GB

Software setup

The quickest path for multi‑GPU systems is vLLM with tensor parallelism:

1docker run --gpus all -v /path/to/model:/model vllm/vllm \
2  --model /model/Ring-2.6-1T \
3  --tensor-parallel-size 8 \
4  --dtype bfloat16 \
5  --quantization fp8  # or use --quantization awq for 4-bit

For llama.cpp (with Q4_K_M):

1./llama-server -m Ring-2.6-1T-Q4_K_M.gguf --parallel 1 --ctx-size 262144 --ngl 99 --n-gpu-layers 80

Ollama support is not yet official as of mid‑2026, but you can create a custom Modelfile using the GGUF quantization.

Expected performance (8× A100 80GB, Q4_K_M)

Task type	Tokens/second (input)	Tokens/second (output)
Short prompts (1-2K tokens)	~15	~12
Long context (100K tokens)	~4	~3
Agent loops (streaming)	~8	~6

Faster on H100s or with FP8 quantization (if supported by your inference engine).

How It Compares

Ring‑2.6‑1T vs DeepSeek‑V3 (671B total, 37B active)

DeepSeek‑V3 has a smaller total model, so lower VRAM requirements (roughly half at the same quantization). For teams with 4‑ or 8‑GPU nodes, DeepSeek‑V3 is easier to deploy.
Ring‑2.6‑1T has more active parameters (63B vs 37B) and significantly longer context (262K vs 128K). It also includes built‑in reasoning effort modes, while DeepSeek‑V3 requires manual prompting.
On agent benchmarks (PinchBench, ClawEval), Ring‑2.6‑1T outperforms DeepSeek‑V3 by 5‑10%. For general chat and code completion, the two are comparable.

Ring‑2.6‑1T vs Llama 3.1 405B (dense)

Llama 3.1 405B requires ~800 GB at FP16, so it fits on 4× A100 80GB—a more accessible setup. Ring‑2.6‑1T requires significantly more VRAM.
Llama 3.1 405B is a dense model with no MoE active‑parameter savings, so per‑token compute is higher. Ring‑2.6‑1T’s MoE gives lower latency relative to parameter count.
For agent tasks and long‑context reasoning, Ring‑2.6‑1T is the clear winner. For straightforward instruction following and multilingual tasks, Llama 3.1 405B remains a strong option.

When to choose an alternative: Your hardware is limited to 2‑4 GPUs, your tasks are mostly short‑format Q&A, or you need to deploy on consumer GPUs today.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.