made by agents

A compact MoE model with only 3B active parameters that outperforms the much larger Qwen3-235B-A22B across most benchmarks. Uses hybrid Gated DeltaNet + MoE architecture. Runs on 8GB+ VRAM GPUs.
Copy and paste this command to start running the model locally.
ollama run qwen3.5:35bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 7.9 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 8.5 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 8.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 9.2 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 9.9 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 12.8 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 40.8 tok/s | 8.5 GB | ||
| SS | 58.9 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
Google Cloud TPU v5eGoogle | SS | 77.3 tok/s | 8.5 GB | |
Intel Arc A770 16GBIntel | SS | 52.8 tok/s | 8.5 GB | |
Intel Arc B580Intel | SS | 43.0 tok/s | 8.5 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 47.6 tok/s | 8.5 GB | |
| SS | 47.6 tok/s | 8.5 GB | ||
| SS | 63.4 tok/s | 8.5 GB | ||
| SS | 69.4 tok/s | 8.5 GB | ||
| SS | 42.3 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 63.4 tok/s | 8.5 GB | |
| SS | 84.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
| SS | 75.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
| SS | 95.1 tok/s | 8.5 GB | ||
| SS | 154.7 tok/s | 8.5 GB | ||
| SS | 169.1 tok/s | 8.5 GB | ||
NVIDIA L40SNVIDIA | SS | 81.5 tok/s | 8.5 GB | |
| SS | 90.6 tok/s | 8.5 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 192.4 tok/s | 8.5 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 316.1 tok/s | 8.5 GB | |
Google Cloud TPU v5pGoogle | SS | 260.9 tok/s | 8.5 GB |
Qwen3.5-35B-A3B represents a significant shift in the efficiency-to-performance ratio for local LLM deployment. Developed by Alibaba Cloud (Qwen) and released in early 2025, this model utilizes a sophisticated Mixture of Experts (MoE) architecture combined with a hybrid Gated DeltaNet framework. Despite having a total parameter count of 35 billion, it only activates 3 billion parameters during any single inference step. This design allows it to deliver reasoning capabilities that exceed much larger dense models while maintaining the inference speed of a small-scale model.
For practitioners running models locally, Qwen3.5-35B-A3B is positioned as a high-utility "daily driver." It effectively bridges the gap between the lightweight 7B-8B parameter models, which often struggle with complex reasoning, and the massive 70B+ models that require dual-GPU setups or significant VRAM offloading. Because it outperforms the much larger Qwen3-235B-A22B across most standard benchmarks, it is currently the most efficient local AI model 35B parameters 2025 has to offer for those prioritizing performance per watt and tokens per second.
The core of the Qwen3.5-35B-A3B architecture is a hybrid Gated DeltaNet + MoE design. Unlike standard dense models where every parameter is calculated for every token, this MoE setup routes inputs to specific "experts." With only 3B active parameters out of 35B total, the model achieves a massive reduction in compute requirements without sacrificing the "knowledge base" stored in its total weights.
In local inference, it is critical to distinguish between VRAM capacity and compute overhead. To run Qwen3.5-35B-A3B locally, your GPU must have enough VRAM to hold the 35B total parameters (determined by your quantization level). However, because only 3B parameters are active, the Qwen3.5-35B-A3B tokens per second will be significantly higher than a dense 35B model, behaving more like a 3B-7B model once the weights are loaded into memory.
The model features a massive context length of 262,144 tokens. This makes it a premier choice for local RAG (Retrieval-Augmented Generation) applications and long-document analysis. Furthermore, it is natively multimodal. It handles vision-language tasks with high precision, allowing for visual reasoning, OCR, and document understanding alongside standard text-based instruction following. Licensed under Apache 2.0, it remains one of the most permissive high-performance models available for commercial and private local use.
Qwen3.5-35B-A3B is a generalist model with specific optimizations for technical workloads. Its 2025 training cutoff ensures it has a modern understanding of software libraries and current events.
For developers, Qwen3.5-35B-A3B for coding is a standout use case. It excels at:
The multimodal capabilities are not an afterthought. Practitioners can use this model for:
Qwen models have historically dominated multilingual benchmarks. Qwen3.5-35B-A3B continues this trend, showing high performance in non-English languages, particularly across CJK (Chinese, Japanese, Korean) and European languages. On the Qwen3.5-35B-A3B reasoning benchmark, it consistently scores in the top tier for GSM8K and MATH, making it suitable for local tutoring or complex logical verification tasks.
The primary bottleneck for this model is VRAM capacity, not compute power. Because of the MoE architecture, even mid-range GPUs can achieve high throughput if the model fits in memory.
To calculate your Qwen3.5-35B-A3B hardware requirements, use the following general guidelines for VRAM:
The fastest way to run Qwen3.5-35B-A3B locally is via Ollama. Once installed, you can pull the model directly:
ollama run qwen3.5:35b
By default, Ollama will likely serve a 4-bit quantized version, which provides a balanced experience for users with 24GB of VRAM.
When evaluating Qwen3.5-35B-A3B, it is most frequently compared to Mistral's Mixtral 8x7B and Llama 3.1 70B.
While Mixtral 8x7B was the pioneer of open MoE models, Qwen3.5-35B-A3B represents a significant generational leap. Qwen offers:
The comparison with Llama 3.1 70B is a question of "efficiency vs. raw power." Llama 3.1 70B is a dense model, meaning it is significantly slower and requires 40GB+ of VRAM for 4-bit quantization. In contrast, Qwen3.5-35B-A3B MoE efficiency allows it to match or beat Llama 70B in coding and multilingual tasks while running at nearly triple the tokens per second on the same hardware. However, for pure English-language creative writing or extremely nuanced prose, Llama 70B may still hold a slight edge in "vibe" and stylistic consistency.
Surprisingly, the 35B-A3B variant often outperforms its 235B predecessor. This is due to the refined training data and the Gated DeltaNet architecture introduced in the 3.5 series. For local practitioners, the 35B-A3B is the clear winner, as the 235B model requires enterprise-grade hardware (A100/H100 clusters) to run effectively, whereas the 35B model is perfectly at home on a single high-end consumer workstation.