Alibaba

Qwen3.5-35B-A3B

A compact MoE model with only 3B active parameters that outperforms the much larger Qwen3-235B-A22B across most benchmarks. Uses hybrid Gated DeltaNet + MoE architecture. Runs on 8GB+ VRAM GPUs.

35B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 35B-parameter MoE language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters35B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size71.9 GB

Community

Monthly Downloads3.1M

Likes1.4K

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:35b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

84.2

69.2

22.4

93.3

40.5

81.8

84.6

Overall Score

75.7AA

Benchmark40%

68.0

Popularity25%

72.5

Efficiency20%

85.2

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.3 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	88.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-35B-A3B on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.202
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3.5-35B-A3B represents a significant shift in the efficiency-to-performance ratio for local LLM deployment. Developed by Alibaba Cloud (Qwen) and released in early 2025, this model utilizes a sophisticated Mixture of Experts (MoE) architecture combined with a hybrid Gated DeltaNet framework. Despite having a total parameter count of 35 billion, it only activates 3 billion parameters during any single inference step. This design allows it to deliver reasoning capabilities that exceed much larger dense models while maintaining the inference speed of a small-scale model.

For practitioners running models locally, Qwen3.5-35B-A3B is positioned as a high-utility "daily driver." It effectively bridges the gap between the lightweight 7B-8B parameter models, which often struggle with complex reasoning, and the massive 70B+ models that require dual-GPU setups or significant VRAM offloading. Because it outperforms the much larger Qwen3-235B-A22B across most standard benchmarks, it is currently the most efficient local AI model 35B parameters 2025 has to offer for those prioritizing performance per watt and tokens per second.

Architecture & Technical Details

The core of the Qwen3.5-35B-A3B architecture is a hybrid Gated DeltaNet + MoE design. Unlike standard dense models where every parameter is calculated for every token, this MoE setup routes inputs to specific "experts." With only 3B active parameters out of 35B total, the model achieves a massive reduction in compute requirements without sacrificing the "knowledge base" stored in its total weights.

Active vs. Total Parameters

In local inference, it is critical to distinguish between VRAM capacity and compute overhead. To run Qwen3.5-35B-A3B locally, your GPU must have enough VRAM to hold the 35B total parameters (determined by your quantization level). However, because only 3B parameters are active, the Qwen3.5-35B-A3B tokens per second will be significantly higher than a dense 35B model, behaving more like a 3B-7B model once the weights are loaded into memory.

Context and Modality

The model features a massive context length of 262,144 tokens. This makes it a premier choice for local RAG (Retrieval-Augmented Generation) applications and long-document analysis. Furthermore, it is natively multimodal. It handles vision-language tasks with high precision, allowing for visual reasoning, OCR, and document understanding alongside standard text-based instruction following. Licensed under Apache 2.0, it remains one of the most permissive high-performance models available for commercial and private local use.

Capabilities & Use Cases

Qwen3.5-35B-A3B is a generalist model with specific optimizations for technical workloads. Its 2025 training cutoff ensures it has a modern understanding of software libraries and current events.

Local Coding and Development

For developers, Qwen3.5-35B-A3B for coding is a standout use case. It excels at:

Complex Refactoring: Handling large codebases within its 262k context window.
Polyglot Programming: High proficiency in Python, Rust, C++, and TypeScript, as well as specialized knowledge of Alibaba Cloud's ecosystem and Go.
Function Calling: It is highly reliable for agentic workflows where the model must output structured JSON to interact with local tools or APIs.

Vision and Document Intelligence

The multimodal capabilities are not an afterthought. Practitioners can use this model for:

Automated Data Entry: Extracting structured data from complex forms and receipts via vision.
Chart-to-Code: Converting screenshots of UI or data visualizations into functional code or Markdown tables.
Visual Reasoning: Identifying components in architectural diagrams or circuit boards.

Multilingual Reasoning and Math

Qwen models have historically dominated multilingual benchmarks. Qwen3.5-35B-A3B continues this trend, showing high performance in non-English languages, particularly across CJK (Chinese, Japanese, Korean) and European languages. On the Qwen3.5-35B-A3B reasoning benchmark, it consistently scores in the top tier for GSM8K and MATH, making it suitable for local tutoring or complex logical verification tasks.

Running Qwen3.5-35B-A3B Locally

The primary bottleneck for this model is VRAM capacity, not compute power. Because of the MoE architecture, even mid-range GPUs can achieve high throughput if the model fits in memory.

Qwen3.5-35B-A3B VRAM Requirements

To calculate your Qwen3.5-35B-A3B hardware requirements, use the following general guidelines for VRAM:

Q4_K_M (Recommended): ~20-22 GB VRAM. This is the "sweet spot" for quality and performance.
Q8_0: ~36-38 GB VRAM. Requires a multi-GPU setup (e.g., 2x RTX 3090/4090) or a Mac with high Unified Memory.
IQ2_XS / Q2_K: ~10-12 GB VRAM. This allows the model to run on 12GB cards like the RTX 3060 or 4070, though with noticeable degradation in reasoning logic.

Best Hardware for Qwen3.5-35B-A3B

NVIDIA RTX 4090 (24GB): The best GPU for Qwen3.5-35B-A3B. It allows for 4-bit quantization (Q4_K_M) with enough headroom for a decent context buffer.
Apple M3/M4 Max (64GB+ Unified Memory): Ideal for running the model at higher bitrates (Q6 or Q8) or utilizing the full 262k context window, which consumes significant memory as the KV cache grows.
NVIDIA RTX 4060 Ti (16GB): Can run the model at 3-bit quantization (Q3_K_S) or 2-bit, though performance is better suited for 8GB+ VRAM GPUs if you utilize GGUF offloading to system RAM (at the cost of speed).

Getting Started with Ollama

The fastest way to run Qwen3.5-35B-A3B locally is via Ollama. Once installed, you can pull the model directly:

ollama run qwen3.5:35b

By default, Ollama will likely serve a 4-bit quantized version, which provides a balanced experience for users with 24GB of VRAM.

How It Compares

When evaluating Qwen3.5-35B-A3B, it is most frequently compared to Mistral's Mixtral 8x7B and Llama 3.1 70B.

Qwen3.5-35B-A3B vs. Mixtral 8x7B

While Mixtral 8x7B was the pioneer of open MoE models, Qwen3.5-35B-A3B represents a significant generational leap. Qwen offers:

Better Reasoning: Higher scores on math and coding benchmarks.
Larger Context: 262k vs. Mixtral’s 32k.
Multimodality: Native vision support which Mixtral 8x7B lacks.

Qwen3.5-35B-A3B vs. Llama 3.1 70B

The comparison with Llama 3.1 70B is a question of "efficiency vs. raw power." Llama 3.1 70B is a dense model, meaning it is significantly slower and requires 40GB+ of VRAM for 4-bit quantization. In contrast, Qwen3.5-35B-A3B MoE efficiency allows it to match or beat Llama 70B in coding and multilingual tasks while running at nearly triple the tokens per second on the same hardware. However, for pure English-language creative writing or extremely nuanced prose, Llama 70B may still hold a slight edge in "vibe" and stylistic consistency.

Qwen3.5-35B-A3B vs. Qwen3-235B-A22B

Surprisingly, the 35B-A3B variant often outperforms its 235B predecessor. This is due to the refined training data and the Gated DeltaNet architecture introduced in the 3.5 series. For local practitioners, the 35B-A3B is the clear winner, as the 235B model requires enterprise-grade hardware (A100/H100 clusters) to run effectively, whereas the 35B model is perfectly at home on a single high-end consumer workstation.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3.5-35B-A3B

A compact MoE model with only 3B active parameters that outperforms the much larger Qwen3-235B-A22B across most benchmarks. Uses hybrid Gated DeltaNet + MoE architecture. Runs on 8GB+ VRAM GPUs.

35B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 35B-parameter MoE language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters35B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size71.9 GB

Community

Monthly Downloads3.1M

Likes1.4K

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:35b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

84.2

69.2

22.4

93.3

40.5

81.8

84.6

Overall Score

75.7AA

Benchmark40%

68.0

Popularity25%

72.5

Efficiency20%

85.2

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.3 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	88.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-35B-A3B on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.202
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Active vs. Total Parameters

Context and Modality

Capabilities & Use Cases

Qwen3.5-35B-A3B is a generalist model with specific optimizations for technical workloads. Its 2025 training cutoff ensures it has a modern understanding of software libraries and current events.

Local Coding and Development

For developers, Qwen3.5-35B-A3B for coding is a standout use case. It excels at:

Complex Refactoring: Handling large codebases within its 262k context window.
Polyglot Programming: High proficiency in Python, Rust, C++, and TypeScript, as well as specialized knowledge of Alibaba Cloud's ecosystem and Go.
Function Calling: It is highly reliable for agentic workflows where the model must output structured JSON to interact with local tools or APIs.

Vision and Document Intelligence

The multimodal capabilities are not an afterthought. Practitioners can use this model for:

Automated Data Entry: Extracting structured data from complex forms and receipts via vision.
Chart-to-Code: Converting screenshots of UI or data visualizations into functional code or Markdown tables.
Visual Reasoning: Identifying components in architectural diagrams or circuit boards.

Multilingual Reasoning and Math

Running Qwen3.5-35B-A3B Locally

The primary bottleneck for this model is VRAM capacity, not compute power. Because of the MoE architecture, even mid-range GPUs can achieve high throughput if the model fits in memory.

Qwen3.5-35B-A3B VRAM Requirements

To calculate your Qwen3.5-35B-A3B hardware requirements, use the following general guidelines for VRAM:

Q4_K_M (Recommended): ~20-22 GB VRAM. This is the "sweet spot" for quality and performance.
Q8_0: ~36-38 GB VRAM. Requires a multi-GPU setup (e.g., 2x RTX 3090/4090) or a Mac with high Unified Memory.
IQ2_XS / Q2_K: ~10-12 GB VRAM. This allows the model to run on 12GB cards like the RTX 3060 or 4070, though with noticeable degradation in reasoning logic.

Best Hardware for Qwen3.5-35B-A3B

NVIDIA RTX 4090 (24GB): The best GPU for Qwen3.5-35B-A3B. It allows for 4-bit quantization (Q4_K_M) with enough headroom for a decent context buffer.
Apple M3/M4 Max (64GB+ Unified Memory): Ideal for running the model at higher bitrates (Q6 or Q8) or utilizing the full 262k context window, which consumes significant memory as the KV cache grows.
NVIDIA RTX 4060 Ti (16GB): Can run the model at 3-bit quantization (Q3_K_S) or 2-bit, though performance is better suited for 8GB+ VRAM GPUs if you utilize GGUF offloading to system RAM (at the cost of speed).

Getting Started with Ollama

The fastest way to run Qwen3.5-35B-A3B locally is via Ollama. Once installed, you can pull the model directly:

ollama run qwen3.5:35b

By default, Ollama will likely serve a 4-bit quantized version, which provides a balanced experience for users with 24GB of VRAM.

How It Compares

When evaluating Qwen3.5-35B-A3B, it is most frequently compared to Mistral's Mixtral 8x7B and Llama 3.1 70B.

Qwen3.5-35B-A3B vs. Mixtral 8x7B

While Mixtral 8x7B was the pioneer of open MoE models, Qwen3.5-35B-A3B represents a significant generational leap. Qwen offers:

Better Reasoning: Higher scores on math and coding benchmarks.
Larger Context: 262k vs. Mixtral’s 32k.
Multimodality: Native vision support which Mixtral 8x7B lacks.

Qwen3.5-35B-A3B vs. Llama 3.1 70B

Qwen3.5-35B-A3B vs. Qwen3-235B-A22B

Related Models

Alibaba

Qwen 3.5 Omni

397BMoE

Alibaba

Qwen3.5-397B-A17B

397BMoE

Alibaba

Qwen3-235B-A22B

235BMoE

Alibaba

Qwen3.5-122B-A10B

122BMoE

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.