made by agents

Alibaba's flagship natively multimodal MoE model with hybrid Gated DeltaNet + MoE architecture. 397B total / 17B active parameters. Supports 201 languages, 262K native context (1M via YaRN). Competes with GPT-5.2 and Claude Opus 4.6.
Copy and paste this command to start running the model locally.
ollama run qwen3.5Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 42.4 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 46.0 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 47.7 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 49.8 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 54.0 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 70.2 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 58.6 tok/s | 46.0 GB | |
Google Cloud TPU v5pGoogle | SS | 48.4 tok/s | 46.0 GB | |
| SS | 42.9 tok/s | 46.0 GB | ||
| SS | 64.7 tok/s | 46.0 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 35.7 tok/s | 46.0 GB | |
NVIDIA H200 SXM 141GBNVIDIA | SS | 84.0 tok/s | 46.0 GB | |
| SS | 92.7 tok/s | 46.0 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 140.0 tok/s | 46.0 GB | |
| SS | 105.0 tok/s | 46.0 GB | ||
| SS | 140.0 tok/s | 46.0 GB | ||
| AA | 14.0 tok/s | 46.0 GB | ||
| BB | 7.0 tok/s | 46.0 GB | ||
| BB | 10.7 tok/s | 46.0 GB | ||
| BB | 14.0 tok/s | 46.0 GB | ||
| BB | 10.7 tok/s | 46.0 GB | ||
| BB | 10.7 tok/s | 46.0 GB | ||
| BB | 7.0 tok/s | 46.0 GB | ||
| BB | 9.6 tok/s | 46.0 GB | ||
| BB | 9.6 tok/s | 46.0 GB | ||
| BB | 9.6 tok/s | 46.0 GB | ||
| BB | 9.6 tok/s | 46.0 GB | ||
| BB | 5.4 tok/s | 46.0 GB | ||
| BB | 4.8 tok/s | 46.0 GB | ||
| BB | 4.8 tok/s | 46.0 GB | ||
| BB | 3.6 tok/s | 46.0 GB |
Qwen3.5-397B-A17B is Alibaba Cloud’s flagship open-weights model, designed to compete directly with frontier closed-source models like GPT-5.2 and Claude 4.6. As a Mixture of Experts (MoE) model, it leverages a massive 397-billion parameter total capacity while only activating 17 billion parameters per token during inference. This architecture allows the model to maintain the reasoning capabilities and knowledge density of a 400B-class model while delivering the inference speed and throughput of a much smaller 17B dense model.
Natively multimodal and licensed under Apache 2.0, Qwen3.5-397B-A17B represents a significant milestone for the open-source community in 2025. It is built on a hybrid Gated DeltaNet and MoE architecture, which optimizes for both long-context stability and computational efficiency. For practitioners, this model is the primary candidate for high-stakes local deployments involving complex reasoning, large-scale code generation, and sophisticated multilingual document processing across 201 supported languages.
The defining characteristic of Qwen3.5-397B-A17B is its MoE efficiency. By utilizing 17B active parameters, the model avoids the massive compute overhead typically associated with dense 400B models. However, local practitioners must distinguish between compute requirements and memory requirements: while it calculates like a 17B model, it must store the weights of a 397B model.
Unlike traditional Transformers that rely solely on standard attention mechanisms, Qwen3.5 integrates Gated DeltaNet. This hybrid approach improves the model's ability to handle its massive 262,144 token native context window. For specialized long-form tasks, the model supports up to 1 million tokens via YaRN (Yet another RoPE extension), making it capable of "reading" entire codebases or multi-hundred-page technical manuals in a single prompt.
The 17B active parameter count means that once the model is loaded into memory, the Qwen3.5-397B-A17B tokens per second (TPS) performance is remarkably high, often exceeding that of much smaller dense models like Llama 3 70B, provided the hardware interface has sufficient memory bandwidth.
Qwen3.5-397B-A17B is a general-purpose powerhouse with specific optimizations for technical and logical workloads. Its Qwen3.5-397B-A17B reasoning benchmark scores place it at the top of the open-weights leaderboard, particularly in math and symbolic logic.
For developers, Qwen3.5-397B-A17B for coding excels at multi-file architecture planning and debugging. Unlike smaller models that struggle with state management across large files, the 262K context window allows it to maintain a coherent understanding of complex dependencies. It supports modern programming languages and can generate production-ready boilerplate, unit tests, and documentation.
The vision capabilities are natively integrated, not bolted on. This allows for:
With support for 201 languages, the model is uniquely suited for global enterprise applications. It maintains high instruction-following accuracy even in low-resource languages, making it a preferred choice for local translation and summarization pipelines that require high nuance.
The primary challenge for any practitioner is the Qwen3.5-397B-A17B hardware requirements. Because the model has 397 billion parameters, the VRAM footprint is the most significant bottleneck for local execution.
To run Qwen3.5-397B-A17B locally, you must select a quantization level that fits your available memory. Running in full FP16 is impractical for most, requiring nearly 800GB of VRAM.
| Quantization | VRAM Required (Approx.) | Recommended Hardware |
| :--- | :--- | :--- |
| Q2_K (2-bit) | ~130 GB | Mac Studio M4 Ultra (192GB) |
| Q4_K_M (4-bit) | ~235 GB | 8x RTX 3090/4090 (24GB each) or A100/H100 80GB Cluster |
| Q8_0 (8-bit) | ~420 GB | Enterprise Multi-Node Cluster |
The best quantization for Qwen3.5-397B-A17B is generally Q4_K_M. This provides a near-lossless experience compared to FP16 while bringing the memory requirement down to a range manageable by high-end workstation clusters.
The best GPU for Qwen3.5-397B-A17B depends on your budget and form factor:
Ollama is the fastest way to get started, as it handles the MoE logic and memory mapping automatically. Use the command ollama run qwen3.5:397b (assuming you have the required VRAM). For more granular control over layers and GPU offloading, llama.cpp or vLLM are recommended.
When evaluating this model against its peers, the distinction usually comes down to the MoE architecture versus dense scaling.
Llama 3.1 405B is a dense model, meaning every parameter is active for every token.
DeepSeek-V3 is a fellow MoE model that pioneered many of the efficiencies Qwen3.5 utilizes.
For practitioners looking for a local AI model 397B parameters 2025 edition, Qwen3.5-397B-A17B is currently the most versatile option for those who need a single model to handle vision, coding, and massive context windows without the latency penalties of dense 400B architectures.