Alibaba Cloud

Qwen3.6 35B-A3B

A highly efficient open-source MoE model activating only 3 billion parameters per pass, rivaling frontier models for local agentic deployment.

35B paramsMoE262K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters35B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityText + Vision

ProviderAlibaba Cloud

Download Size71.9 GB

Community

Monthly Downloads861.2K

Likes1.4K

Last UpdatedYesterday

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.6

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

GPQA

86.0

MMLU-PRO

85.2

SWE-Verified

73.4

HLE

21.4

AIME 2026

92.7

Terminal Bench

51.5

SWE-Pro

49.5

HMMT 2026

83.6

Overall Score

68.9BB

Benchmark40%

67.9

Popularity25%

64.4

Efficiency20%

85.2

Versatility15%

57.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	90.6 tok/s	8.5 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	192.4 tok/s	8.5 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	316.1 tok/s	8.5 GB
Google Cloud TPU v5pGoogle	SS	260.9 tok/s	8.5 GB

Rows per page

Page 1 of 4

About This Model

Qwen3.6 35B-A3B is Alibaba Cloud’s latest advancement in the sparse Mixture-of-Experts (MoE) category. By utilizing a 35B total parameter architecture while activating only 3B parameters per forward pass, this model delivers the reasoning depth of a mid-sized model with the inference speed and efficiency typically associated with much smaller 7B-class models. Released under the Apache 2.0 license, it is specifically optimized for local deployment in agentic workflows, long-context repository analysis, and multimodal visual reasoning.

While previous iterations focused on general chat, Qwen3.6 35B-A3B is positioned as a primary engine for local AI agents. It introduces "thinking preservation," allowing the model to retain internal reasoning chains across historical messages. This makes it a formidable competitor to models like Gemma 2 27B and Llama 3.1 8B, offering a superior "intelligence-per-watt" ratio for users running hardware on the edge.

Architecture & Technical Details

The defining characteristic of Qwen3.6 35B-A3B is its sparse MoE architecture. Unlike dense models where every parameter is calculated for every token, this model routes inputs to a subset of specialists.

Total Parameters: 35 Billion
Active Parameters: 3 Billion
Context Window: 262,144 tokens (natively supported)
Architecture: MoE with 256 total experts, activating 8 routed experts plus 1 shared expert per token.

The Qwen3.6 35B-A3B MoE efficiency comes from its ability to maintain a massive knowledge base (35B parameters) while requiring the FLOPs of a 3B model during generation. This translates to significantly higher tokens per second (TPS) compared to dense 30B+ models. Furthermore, the model features a 256K context length, which is extensible up to 1 million tokens, making it suitable for analyzing large codebases or long documents without the performance degradation often seen in smaller-context models.

Capabilities & Use Cases

This model is a multimodal powerhouse designed for "agentic" tasks—workflows where the AI must plan, use tools, and reason through multi-step problems.

Agentic Coding and Tool Use

Qwen3.6 35B-A3B excels in agentic coding, specifically handling frontend workflows and repository-level reasoning. It is designed to work natively with tools like OpenClaw, Claude Code, and Qwen Studio. Its function-calling capabilities are robust, allowing it to interface with local file systems and APIs to execute complex developer tasks.

Multimodal Vision & Spatial Intelligence

As a text-and-vision model, it performs exceptionally well in visual perception. With a RefCOCO score of 92.0, it can accurately identify and locate objects within images, making it useful for UI/UX automation, document parsing, and spatial reasoning tasks.

Reasoning and "Thinking Preservation"

The model supports an internal chain-of-thought (thinking) mechanism. In iterative development, the model can preserve the context of its reasoning from previous turns, reducing the "hallucination" rate when debugging code or solving complex logic puzzles.

Running Qwen3.6 35B-A3B Locally

To run Qwen3.6 35B-A3B locally, your primary constraint is VRAM. Because it is an MoE model, you must fit the total parameter count (35B) into memory, even though only 3B are active during compute.

Hardware Requirements & VRAM

To run this model comfortably, you should target a 24GB VRAM buffer for quantized versions or 64GB+ of unified memory for FP16/BF16.

RTX 3090 / 4090 (24GB): This is the "sweet spot" for consumer hardware. At 4-bit quantization (Q4_K_M), the model fits within 20–22GB, leaving room for a modest KV cache.
Mac Studio/MacBook Pro (M2/M3/M4 Max): 32GB of unified memory is the bare minimum for 4-bit, but 64GB is recommended to utilize the full 262K context window.
Dual GPU Setups: Running two RTX 3060 (12GB) cards via NVLink or PCIe will allow for higher-bit quantizations (Q6 or Q8).

Best Quantization for Qwen3.6 35B-A3B

For most practitioners, Q4_K_M GGUF is the recommended quantization. It offers a negligible perplexity loss compared to FP16 while reducing the memory footprint by over 50%. If you are prioritized on speed and have limited VRAM, IQ4_XS or Q3_K_L can be used, though reasoning accuracy in complex coding tasks may slightly decline.

Expected Performance (Tokens Per Second)

On an RTX 4090 using Ollama or vLLM, expect:

Prompt Processing: 500+ tokens/sec (highly efficient due to MoE).
Generation: 40–60 tokens/sec (depending on quantization and context load).

The quickest way to deploy is via Ollama:

ollama run qwen3.6:35b

How It Compares

When evaluating Qwen3.6 35B-A3B vs competitors, the primary trade-off is memory footprint versus inference speed.

Vs. Llama 3.1 8B: Qwen3.6 is significantly more intelligent, especially in coding and vision, but requires roughly 4x the VRAM. If you have a 24GB card, Qwen is the clear winner; if you have an 8GB card, Llama is your only option.
Vs. Gemma 2 27B: Gemma 2 27B is a dense model. While it has a smaller total parameter count, it is often slower in generation than Qwen’s 3B active parameters. Qwen3.6 generally outperforms Gemma 2 in coding and long-context retrieval.
Vs. Qwen3.5 27B (Dense): The 3.6 MoE variant achieves a leap in programming capabilities while requiring less computational power per token generated, effectively replacing the older dense 27B model for most local agentic use cases.

Qwen3.6 35B-A3B represents the current "Goldilocks" zone for local AI: it is large enough to handle professional-grade coding and reasoning, yet efficient enough to run on a single flagship consumer GPU.

Related Models

Alibaba Cloud

Qwen 3.5 Omni

397BMoE

Alibaba Cloud

Qwen3.6-27B

27BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Alibaba Cloud

Qwen3.6 35B-A3B

A highly efficient open-source MoE model activating only 3 billion parameters per pass, rivaling frontier models for local agentic deployment.

35B paramsMoE262K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters35B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityText + Vision

ProviderAlibaba Cloud

Download Size71.9 GB

Community

Monthly Downloads861.2K

Likes1.4K

Last UpdatedYesterday

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.6

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

GPQA

86.0

MMLU-PRO

85.2

SWE-Verified

73.4

HLE

21.4

AIME 2026

92.7

Terminal Bench

51.5

SWE-Pro

49.5

HMMT 2026

83.6

Overall Score

68.9BB

Benchmark40%

67.9

Popularity25%

64.4

Efficiency20%

85.2

Versatility15%

57.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	90.6 tok/s	8.5 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	192.4 tok/s	8.5 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	316.1 tok/s	8.5 GB
Google Cloud TPU v5pGoogle	SS	260.9 tok/s	8.5 GB

Rows per page

Page 1 of 4

About This Model

Architecture & Technical Details

Total Parameters: 35 Billion
Active Parameters: 3 Billion
Context Window: 262,144 tokens (natively supported)
Architecture: MoE with 256 total experts, activating 8 routed experts plus 1 shared expert per token.

Capabilities & Use Cases

This model is a multimodal powerhouse designed for "agentic" tasks—workflows where the AI must plan, use tools, and reason through multi-step problems.

Agentic Coding and Tool Use

Multimodal Vision & Spatial Intelligence

Reasoning and "Thinking Preservation"

Running Qwen3.6 35B-A3B Locally

Hardware Requirements & VRAM

To run this model comfortably, you should target a 24GB VRAM buffer for quantized versions or 64GB+ of unified memory for FP16/BF16.

RTX 3090 / 4090 (24GB): This is the "sweet spot" for consumer hardware. At 4-bit quantization (Q4_K_M), the model fits within 20–22GB, leaving room for a modest KV cache.
Mac Studio/MacBook Pro (M2/M3/M4 Max): 32GB of unified memory is the bare minimum for 4-bit, but 64GB is recommended to utilize the full 262K context window.
Dual GPU Setups: Running two RTX 3060 (12GB) cards via NVLink or PCIe will allow for higher-bit quantizations (Q6 or Q8).

Best Quantization for Qwen3.6 35B-A3B

Expected Performance (Tokens Per Second)

On an RTX 4090 using Ollama or vLLM, expect:

Prompt Processing: 500+ tokens/sec (highly efficient due to MoE).
Generation: 40–60 tokens/sec (depending on quantization and context load).

The quickest way to deploy is via Ollama:

ollama run qwen3.6:35b

How It Compares

When evaluating Qwen3.6 35B-A3B vs competitors, the primary trade-off is memory footprint versus inference speed.

Vs. Llama 3.1 8B: Qwen3.6 is significantly more intelligent, especially in coding and vision, but requires roughly 4x the VRAM. If you have a 24GB card, Qwen is the clear winner; if you have an 8GB card, Llama is your only option.
Vs. Gemma 2 27B: Gemma 2 27B is a dense model. While it has a smaller total parameter count, it is often slower in generation than Qwen’s 3B active parameters. Qwen3.6 generally outperforms Gemma 2 in coding and long-context retrieval.
Vs. Qwen3.5 27B (Dense): The 3.6 MoE variant achieves a leap in programming capabilities while requiring less computational power per token generated, effectively replacing the older dense 27B model for most local agentic use cases.

Related Models

Alibaba Cloud

Qwen 3.5 Omni

397BMoE

Alibaba Cloud

Qwen3.6-27B

27BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.