
A highly efficient open-source MoE model activating only 3 billion parameters per pass, rivaling frontier models for local agentic deployment.
Copy and paste this command to start running the model locally.
ollama run qwen3.6Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 7.9 GB | Low | |
| Q4_K_MRecommended | 8.5 GB | Good | |
| Q5_K_M | 8.8 GB | Very Good | |
| Q6_K | 9.2 GB | Excellent | |
| Q8_0 | 9.9 GB | Near Perfect | |
| FP16 | 12.8 GB | Full |
See which devices can run this model and at what quality level.
| SS | 40.8 tok/s | 8.5 GB | ||
| SS | 58.9 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
Google Cloud TPU v5eGoogle | SS | 77.3 tok/s | 8.5 GB | |
Intel Arc A770 16GBIntel | SS | 52.8 tok/s | 8.5 GB | |
Intel Arc B580Intel | SS | 43.0 tok/s | 8.5 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 47.6 tok/s | 8.5 GB | |
| SS | 47.6 tok/s | 8.5 GB | ||
| SS | 63.4 tok/s | 8.5 GB | ||
| SS | 69.4 tok/s | 8.5 GB | ||
| SS | 42.3 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 63.4 tok/s | 8.5 GB | |
| SS | 84.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
| SS | 75.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
| SS | 95.1 tok/s | 8.5 GB | ||
| SS | 154.7 tok/s | 8.5 GB | ||
| SS | 169.1 tok/s | 8.5 GB | ||
NVIDIA L40SNVIDIA | SS | 81.5 tok/s | 8.5 GB | |
| SS | 90.6 tok/s | 8.5 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 192.4 tok/s | 8.5 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 316.1 tok/s | 8.5 GB | |
Google Cloud TPU v5pGoogle | SS | 260.9 tok/s | 8.5 GB |
Qwen3.6 35B-A3B is Alibaba Cloud’s latest advancement in the sparse Mixture-of-Experts (MoE) category. By utilizing a 35B total parameter architecture while activating only 3B parameters per forward pass, this model delivers the reasoning depth of a mid-sized model with the inference speed and efficiency typically associated with much smaller 7B-class models. Released under the Apache 2.0 license, it is specifically optimized for local deployment in agentic workflows, long-context repository analysis, and multimodal visual reasoning.
While previous iterations focused on general chat, Qwen3.6 35B-A3B is positioned as a primary engine for local AI agents. It introduces "thinking preservation," allowing the model to retain internal reasoning chains across historical messages. This makes it a formidable competitor to models like Gemma 2 27B and Llama 3.1 8B, offering a superior "intelligence-per-watt" ratio for users running hardware on the edge.
The defining characteristic of Qwen3.6 35B-A3B is its sparse MoE architecture. Unlike dense models where every parameter is calculated for every token, this model routes inputs to a subset of specialists.
The Qwen3.6 35B-A3B MoE efficiency comes from its ability to maintain a massive knowledge base (35B parameters) while requiring the FLOPs of a 3B model during generation. This translates to significantly higher tokens per second (TPS) compared to dense 30B+ models. Furthermore, the model features a 256K context length, which is extensible up to 1 million tokens, making it suitable for analyzing large codebases or long documents without the performance degradation often seen in smaller-context models.
This model is a multimodal powerhouse designed for "agentic" tasks—workflows where the AI must plan, use tools, and reason through multi-step problems.
Qwen3.6 35B-A3B excels in agentic coding, specifically handling frontend workflows and repository-level reasoning. It is designed to work natively with tools like OpenClaw, Claude Code, and Qwen Studio. Its function-calling capabilities are robust, allowing it to interface with local file systems and APIs to execute complex developer tasks.
As a text-and-vision model, it performs exceptionally well in visual perception. With a RefCOCO score of 92.0, it can accurately identify and locate objects within images, making it useful for UI/UX automation, document parsing, and spatial reasoning tasks.
The model supports an internal chain-of-thought (thinking) mechanism. In iterative development, the model can preserve the context of its reasoning from previous turns, reducing the "hallucination" rate when debugging code or solving complex logic puzzles.
To run Qwen3.6 35B-A3B locally, your primary constraint is VRAM. Because it is an MoE model, you must fit the total parameter count (35B) into memory, even though only 3B are active during compute.
To run this model comfortably, you should target a 24GB VRAM buffer for quantized versions or 64GB+ of unified memory for FP16/BF16.
For most practitioners, Q4_K_M GGUF is the recommended quantization. It offers a negligible perplexity loss compared to FP16 while reducing the memory footprint by over 50%. If you are prioritized on speed and have limited VRAM, IQ4_XS or Q3_K_L can be used, though reasoning accuracy in complex coding tasks may slightly decline.
On an RTX 4090 using Ollama or vLLM, expect:
The quickest way to deploy is via Ollama:
ollama run qwen3.6:35b
When evaluating Qwen3.6 35B-A3B vs competitors, the primary trade-off is memory footprint versus inference speed.
Qwen3.6 35B-A3B represents the current "Goldilocks" zone for local AI: it is large enough to handle professional-grade coding and reasoning, yet efficient enough to run on a single flagship consumer GPU.