Alibaba

Qwen3.5-122B-A10B

Alibaba's large-medium MoE model with 122B total / 10B active parameters. Leads on agentic benchmarks including BFCL-V4 and BrowseComp. Natively multimodal with 262K context.

122B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A solid 122B-parameter MoE language model from Alibaba. Pulls ahead on graduate-level reasoning (GPQA) (87/100), so reach for it when that's the dimension that matters.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~27.3 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters122B

Active Params10B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size250.2 GB

Community

Monthly Downloads1.0M

Likes544

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:122b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

86.6

72.0

25.3

49.4

87.7

Overall Score

62.1BB

Benchmark40%

64.2

Popularity25%

39.8

Efficiency20%

65.6

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	25.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	27.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	28.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	29.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	32.0 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	41.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	60.2 tok/s	27.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	98.9 tok/s	27.3 GB
Google Cloud TPU v5pGoogle	SS	81.6 tok/s	27.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	72.3 tok/s	27.3 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	109.2 tok/s	27.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	141.7 tok/s	27.3 GB
AMD Instinct MI300XAMD	SS	156.4 tok/s	27.3 GB
Google TPU v7 (Ironwood)Google	SS	217.8 tok/s	27.3 GB
NVIDIA B200 GPUNVIDIA	SS	236.1 tok/s	27.3 GB
Google Cloud TPU v6e (Trillium)Google	SS	48.4 tok/s	27.3 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	52.9 tok/s	27.3 GB
Origin PC M-CLASS v2Origin PC	SS	52.9 tok/s	27.3 GB
AMD Instinct MI325XAMD	SS	177.1 tok/s	27.3 GB
AMD Instinct MI355XAMD	SS	236.1 tok/s	27.3 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	28.3 tok/s	27.3 GB
Origin PC L-CLASS v2Origin PC	SS	28.3 tok/s	27.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	209.6 tok/s	27.3 GB
Dell Pro Max with GB300Dell	SS	209.6 tok/s	27.3 GB
HP ZGX Fury AI StationHP	SS	209.6 tok/s	27.3 GB
MSI XpertStation WS300MSI	SS	209.6 tok/s	27.3 GB
SuperMicro Super AI StationSuperMicro	SS	209.6 tok/s	27.3 GB
Gigabyte W775-V10-L01Gigabyte	SS	209.6 tok/s	27.3 GB
NVIDIA L40SNVIDIA	AA	25.5 tok/s	27.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	23.6 tok/s	27.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	23.6 tok/s	27.3 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-122B-A10B on Apple M4 · ~3.5 tok/s · 25W	$0.235
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 27 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3.5-122B-A10B is Alibaba Cloud’s flagship Mixture of Experts (MoE) model designed to bridge the gap between mid-sized dense models and massive frontier models. With 122 billion total parameters and 10 billion active parameters per token, it strikes a specific balance: the reasoning depth of a 100B+ parameter model with the inference speed of a much smaller 10B parameter architecture.

Released in early 2025, this model is specifically optimized for agentic workflows. It currently leads on key industry benchmarks including BFCL-V4 (Berkeley Function Calling Leaderboard) and BrowseComp, making it a primary choice for developers building autonomous agents that need to navigate the web or interact with complex APIs. Unlike many of its predecessors, Qwen3.5-122B-A10B is natively multimodal and features a massive 262,144-token context window, allowing for the ingestion of entire codebases or lengthy technical documents in a single prompt.

For practitioners looking to run Qwen3.5-122B-A10B locally, the primary hurdle is VRAM capacity rather than raw compute power. Because it is an MoE model, it requires enough memory to house all 122B parameters, but it executes with the efficiency of a much smaller model, resulting in high Qwen3.5-122B-A10B tokens per second even on prosumer hardware.

Architecture & Technical Details

The "A10B" in the model name signifies that only 10 billion parameters are activated during the forward pass for any given token. This Qwen3.5-122B-A10B MoE efficiency is what makes the model viable for local deployment. While a dense 122B model would be agonizingly slow on anything but a multi-H100 cluster, the MoE architecture allows the model to "route" queries to the most relevant specialized sub-networks (experts).

Memory vs. Compute

When evaluating Qwen3.5-122B-A10B hardware requirements, you must distinguish between memory bandwidth and memory capacity.

Capacity: You need enough VRAM (or System RAM in the case of Mac Unified Memory) to hold the weights. For a 122B model, FP16 precision requires ~244GB of VRAM, which is out of reach for most local setups. However, with 4-bit quantization, this drops to approximately 75-80GB.
Compute: Because only 10B parameters are active, the "thinking" time per token is low. If you have the bandwidth to move the data, the generation speed is comparable to a 10B-14B dense model.

Context and Multimodality

The 262k context length is supported by an advanced RoPE (Rotary Positional Embedding) scaling implementation, ensuring that the model maintains coherence even at the tail end of a massive prompt. Being natively multimodal, the model does not rely on a separate vision encoder "bolted on" to the LLM; the vision and text capabilities are interleaved, which improves performance in tasks like OCR, chart reasoning, and visual document understanding.

Capabilities & Use Cases

Qwen3.5-122B-A10B is not a general-purpose "chatbot" in the traditional sense; it is a high-reasoning engine built for production-grade tasks.

Advanced Logic and Reasoning

On the Qwen3.5-122B-A10B reasoning benchmark suites, the model excels at multi-step problem solving. This makes it ideal for:

Complex Math: Solving graduate-level mathematics and symbolic logic.
Instruction Following: Handling deeply nested instructions without "forgetting" constraints.

Agentic Workflows and Function Calling

This is the model's standout feature. Because it leads on BFCL-V4, it is highly reliable at:

API Interoperability: Correctly formatting JSON for external tool calls.
Web Browsing: Using tools like Playwright or Selenium to navigate websites and extract data.
Autonomous Coding: Qwen3.5-122B-A10B for coding is particularly effective in repository-level tasks where it must understand how different files interact across a large context.

Multilingual Performance

Alibaba has trained the Qwen series on a massive corpus of non-English data. It remains one of the best models for Chinese, Japanese, Korean, and various European languages, maintaining high nuance and cultural context that Western-centric models often miss.

Running Qwen3.5-122B-A10B Locally

To run Qwen3.5-122B-A10B locally, your primary concern is the 122B parameter count. Even though it's efficient to run, it is "heavy" to store.

VRAM Requirements & Quantization

The Qwen3.5-122B-A10B VRAM requirements vary wildly based on your choice of quantization.

Precision	VRAM Required	Recommended Hardware
FP16	~245 GB	4x A100 (80GB) or 8x RTX 6000 Ada
Q8_0	~130 GB	2x Mac Studio M2/M3 Ultra (192GB) or 6x RTX 3090/4090
Q4_K_M	~78 GB	1x Mac Studio (128GB+) or 4x RTX 3090/4090 (24GB)
IQ3_M	~55 GB	3x RTX 3090/4090 or Mac Studio (64GB+)

The best quantization for Qwen3.5-122B-A10B for most practitioners is Q4_K_M. This provides a near-indistinguishable loss in perplexity compared to FP16 while fitting into the memory footprint of common multi-GPU or high-memory Mac setups.

Specific Hardware Recommendations

NVIDIA Users: To achieve usable speeds, a multi-GPU setup is required. A 4x RTX 3090 or 4x RTX 4090 NVLink-less setup using llama.cpp or vLLM is the gold standard. This provides 96GB of VRAM, leaving enough room for the 78GB model plus a sizeable KV cache for long-context tasks.
Mac Users: This is the best GPU for Qwen3.5-122B-A10B in terms of simplicity. An M2 Ultra or M3/M4 Max with at least 128GB of Unified Memory can run the Q4_K_M quantization entirely in memory.
Budget Setups: If you are wondering how to run 122B model on consumer GPU setups with limited VRAM (e.g., a single 4090), you will be forced to use heavy quantization (IQ2_XS) or system RAM offloading. System RAM offloading via GGUF will result in speeds of ~1-2 tokens per second, which is acceptable for batch processing but poor for interactive chat.

Getting Started with Ollama

The fastest way to test this model is via Ollama. Once you have the necessary RAM/VRAM, you can run:

ollama run qwen3.5:122b

This will default to a quantized version suitable for your hardware.

How It Compares

When choosing a local AI model 122B parameters 2025 edition, you are likely looking at a few specific competitors.

Qwen3.5-122B-A10B vs. Llama 3.3 70B

Llama 3.3 70B is a dense model. While it has fewer total parameters, it requires significantly more compute per token than Qwen's 10B active parameters.

VRAM: Llama 3.3 70B is easier to fit (40GB for Q4).
Intelligence: Qwen3.5-122B-A10B generally outperforms Llama 3.3 in coding and agentic tasks due to its higher total parameter count and specialized experts.
Speed: Qwen is often faster in terms of tokens per second because it only calculates 10B parameters per pass.

Qwen3.5-122B-A10B vs. DeepSeek-V3

DeepSeek-V3 is a larger MoE (671B total).

Local Viability: Qwen3.5-122B is much more "local-friendly." DeepSeek-V3 requires nearly 400GB of VRAM even at high quantization, necessitating an enterprise-grade cluster.
Performance: DeepSeek-V3 is more powerful in raw reasoning, but Qwen3.5-122B is the practical choice for a "heavyweight" model that can actually reside on a workstation.

Qwen3.5-122B-A10B represents the current ceiling for what a high-end consumer workstation can reasonably handle while maintaining frontier-level performance in coding and tool use. If your workflow involves long-context document analysis or autonomous agents, and you have the VRAM to support it, this is currently the most capable model in the 100B-150B MoE class.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3.5-122B-A10B

Alibaba's large-medium MoE model with 122B total / 10B active parameters. Leads on agentic benchmarks including BFCL-V4 and BrowseComp. Natively multimodal with 262K context.

122B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A solid 122B-parameter MoE language model from Alibaba. Pulls ahead on graduate-level reasoning (GPQA) (87/100), so reach for it when that's the dimension that matters.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~27.3 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters122B

Active Params10B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size250.2 GB

Community

Monthly Downloads1.0M

Likes544

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:122b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

86.6

72.0

25.3

49.4

87.7

Overall Score

62.1BB

Benchmark40%

64.2

Popularity25%

39.8

Efficiency20%

65.6

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	25.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	27.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	28.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	29.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	32.0 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	41.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	60.2 tok/s	27.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	98.9 tok/s	27.3 GB
Google Cloud TPU v5pGoogle	SS	81.6 tok/s	27.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	72.3 tok/s	27.3 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	109.2 tok/s	27.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	141.7 tok/s	27.3 GB
AMD Instinct MI300XAMD	SS	156.4 tok/s	27.3 GB
Google TPU v7 (Ironwood)Google	SS	217.8 tok/s	27.3 GB
NVIDIA B200 GPUNVIDIA	SS	236.1 tok/s	27.3 GB
Google Cloud TPU v6e (Trillium)Google	SS	48.4 tok/s	27.3 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	52.9 tok/s	27.3 GB
Origin PC M-CLASS v2Origin PC	SS	52.9 tok/s	27.3 GB
AMD Instinct MI325XAMD	SS	177.1 tok/s	27.3 GB
AMD Instinct MI355XAMD	SS	236.1 tok/s	27.3 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	28.3 tok/s	27.3 GB
Origin PC L-CLASS v2Origin PC	SS	28.3 tok/s	27.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	209.6 tok/s	27.3 GB
Dell Pro Max with GB300Dell	SS	209.6 tok/s	27.3 GB
HP ZGX Fury AI StationHP	SS	209.6 tok/s	27.3 GB
MSI XpertStation WS300MSI	SS	209.6 tok/s	27.3 GB
SuperMicro Super AI StationSuperMicro	SS	209.6 tok/s	27.3 GB
Gigabyte W775-V10-L01Gigabyte	SS	209.6 tok/s	27.3 GB
NVIDIA L40SNVIDIA	AA	25.5 tok/s	27.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	23.6 tok/s	27.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	23.6 tok/s	27.3 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-122B-A10B on Apple M4 · ~3.5 tok/s · 25W	$0.235
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 27 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Memory vs. Compute

When evaluating Qwen3.5-122B-A10B hardware requirements, you must distinguish between memory bandwidth and memory capacity.

Capacity: You need enough VRAM (or System RAM in the case of Mac Unified Memory) to hold the weights. For a 122B model, FP16 precision requires ~244GB of VRAM, which is out of reach for most local setups. However, with 4-bit quantization, this drops to approximately 75-80GB.
Compute: Because only 10B parameters are active, the "thinking" time per token is low. If you have the bandwidth to move the data, the generation speed is comparable to a 10B-14B dense model.

Context and Multimodality

Capabilities & Use Cases

Qwen3.5-122B-A10B is not a general-purpose "chatbot" in the traditional sense; it is a high-reasoning engine built for production-grade tasks.

Advanced Logic and Reasoning

On the Qwen3.5-122B-A10B reasoning benchmark suites, the model excels at multi-step problem solving. This makes it ideal for:

Complex Math: Solving graduate-level mathematics and symbolic logic.
Instruction Following: Handling deeply nested instructions without "forgetting" constraints.

Agentic Workflows and Function Calling

This is the model's standout feature. Because it leads on BFCL-V4, it is highly reliable at:

API Interoperability: Correctly formatting JSON for external tool calls.
Web Browsing: Using tools like Playwright or Selenium to navigate websites and extract data.
Autonomous Coding: Qwen3.5-122B-A10B for coding is particularly effective in repository-level tasks where it must understand how different files interact across a large context.

Multilingual Performance

Running Qwen3.5-122B-A10B Locally

To run Qwen3.5-122B-A10B locally, your primary concern is the 122B parameter count. Even though it's efficient to run, it is "heavy" to store.

VRAM Requirements & Quantization

The Qwen3.5-122B-A10B VRAM requirements vary wildly based on your choice of quantization.

Precision	VRAM Required	Recommended Hardware
FP16	~245 GB	4x A100 (80GB) or 8x RTX 6000 Ada
Q8_0	~130 GB	2x Mac Studio M2/M3 Ultra (192GB) or 6x RTX 3090/4090
Q4_K_M	~78 GB	1x Mac Studio (128GB+) or 4x RTX 3090/4090 (24GB)
IQ3_M	~55 GB	3x RTX 3090/4090 or Mac Studio (64GB+)

Specific Hardware Recommendations

NVIDIA Users: To achieve usable speeds, a multi-GPU setup is required. A 4x RTX 3090 or 4x RTX 4090 NVLink-less setup using llama.cpp or vLLM is the gold standard. This provides 96GB of VRAM, leaving enough room for the 78GB model plus a sizeable KV cache for long-context tasks.
Mac Users: This is the best GPU for Qwen3.5-122B-A10B in terms of simplicity. An M2 Ultra or M3/M4 Max with at least 128GB of Unified Memory can run the Q4_K_M quantization entirely in memory.
Budget Setups: If you are wondering how to run 122B model on consumer GPU setups with limited VRAM (e.g., a single 4090), you will be forced to use heavy quantization (IQ2_XS) or system RAM offloading. System RAM offloading via GGUF will result in speeds of ~1-2 tokens per second, which is acceptable for batch processing but poor for interactive chat.

Getting Started with Ollama

The fastest way to test this model is via Ollama. Once you have the necessary RAM/VRAM, you can run:

ollama run qwen3.5:122b

This will default to a quantized version suitable for your hardware.

How It Compares

When choosing a local AI model 122B parameters 2025 edition, you are likely looking at a few specific competitors.

Qwen3.5-122B-A10B vs. Llama 3.3 70B

Llama 3.3 70B is a dense model. While it has fewer total parameters, it requires significantly more compute per token than Qwen's 10B active parameters.

VRAM: Llama 3.3 70B is easier to fit (40GB for Q4).
Intelligence: Qwen3.5-122B-A10B generally outperforms Llama 3.3 in coding and agentic tasks due to its higher total parameter count and specialized experts.
Speed: Qwen is often faster in terms of tokens per second because it only calculates 10B parameters per pass.

Qwen3.5-122B-A10B vs. DeepSeek-V3

DeepSeek-V3 is a larger MoE (671B total).

Local Viability: Qwen3.5-122B is much more "local-friendly." DeepSeek-V3 requires nearly 400GB of VRAM even at high quantization, necessitating an enterprise-grade cluster.
Performance: DeepSeek-V3 is more powerful in raw reasoning, but Qwen3.5-122B is the practical choice for a "heavyweight" model that can actually reside on a workstation.