Alibaba

Qwen3.5-397B-A17B

Alibaba's flagship natively multimodal MoE model with hybrid Gated DeltaNet + MoE architecture. 397B total / 17B active parameters. Supports 201 languages, 262K native context (1M via YaRN). Competes with GPT-5.2 and Claude Opus 4.6.

397B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 397B-parameter MoE language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters. Currently trending on Hugging Face — community interest is climbing.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~46.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Summarization

Instruction Following

Model Specifications

Parameters397B

Active Params17B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size806.8 GB

Community

Monthly Downloads1.1M

Likes1.5K

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

88.4

87.8

76.4

28.7

93.3

52.5

87.9

91.2

Overall Score

73.8AA

Benchmark40%

75.8

Popularity25%

65.6

Efficiency20%

60.7

Versatility15%

100.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	42.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	46.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	47.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	49.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	54.0 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	70.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	58.6 tok/s	46.0 GB
Google Cloud TPU v5pGoogle	SS	48.4 tok/s	46.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	42.9 tok/s	46.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	64.7 tok/s	46.0 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	35.7 tok/s	46.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	84.0 tok/s	46.0 GB
AMD Instinct MI300XAMD	SS	92.7 tok/s	46.0 GB
Google TPU v7 (Ironwood)Google	SS	129.1 tok/s	46.0 GB
NVIDIA B200 GPUNVIDIA	SS	140.0 tok/s	46.0 GB
AMD Instinct MI325XAMD	SS	105.0 tok/s	46.0 GB
AMD Instinct MI355XAMD	SS	140.0 tok/s	46.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	124.2 tok/s	46.0 GB
Dell Pro Max with GB300Dell	SS	124.2 tok/s	46.0 GB
HP ZGX Fury AI StationHP	SS	124.2 tok/s	46.0 GB
MSI XpertStation WS300MSI	SS	124.2 tok/s	46.0 GB
SuperMicro Super AI StationSuperMicro	SS	124.2 tok/s	46.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	124.2 tok/s	46.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.0 tok/s	46.0 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.0 tok/s	46.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.0 tok/s	46.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.7 tok/s	46.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.0 tok/s	46.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.7 tok/s	46.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.7 tok/s	46.0 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.0 tok/s	46.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-397B-A17B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.5 tok/s · 150W	$1.12
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 46 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3.5-397B-A17B is Alibaba Cloud’s flagship open-weights model, designed to compete directly with frontier closed-source models like GPT-5.2 and Claude 4.6. As a Mixture of Experts (MoE) model, it leverages a massive 397-billion parameter total capacity while only activating 17 billion parameters per token during inference. This architecture allows the model to maintain the reasoning capabilities and knowledge density of a 400B-class model while delivering the inference speed and throughput of a much smaller 17B dense model.

Natively multimodal and licensed under Apache 2.0, Qwen3.5-397B-A17B represents a significant milestone for the open-source community in 2025. It is built on a hybrid Gated DeltaNet and MoE architecture, which optimizes for both long-context stability and computational efficiency. For practitioners, this model is the primary candidate for high-stakes local deployments involving complex reasoning, large-scale code generation, and sophisticated multilingual document processing across 201 supported languages.

Architecture & Technical Details

The defining characteristic of Qwen3.5-397B-A17B is its MoE efficiency. By utilizing 17B active parameters, the model avoids the massive compute overhead typically associated with dense 400B models. However, local practitioners must distinguish between compute requirements and memory requirements: while it calculates like a 17B model, it must store the weights of a 397B model.

Hybrid Gated DeltaNet

Unlike traditional Transformers that rely solely on standard attention mechanisms, Qwen3.5 integrates Gated DeltaNet. This hybrid approach improves the model's ability to handle its massive 262,144 token native context window. For specialized long-form tasks, the model supports up to 1 million tokens via YaRN (Yet another RoPE extension), making it capable of "reading" entire codebases or multi-hundred-page technical manuals in a single prompt.

Active vs. Total Parameters

Total Parameters: 397B (Determines VRAM requirements)
Active Parameters: 17B (Determines inference speed/latency)
Modality: Native Multimodal (Vision-Language-Action)
Training Cutoff: 2025

The 17B active parameter count means that once the model is loaded into memory, the Qwen3.5-397B-A17B tokens per second (TPS) performance is remarkably high, often exceeding that of much smaller dense models like Llama 3 70B, provided the hardware interface has sufficient memory bandwidth.

Capabilities & Use Cases

Qwen3.5-397B-A17B is a general-purpose powerhouse with specific optimizations for technical and logical workloads. Its Qwen3.5-397B-A17B reasoning benchmark scores place it at the top of the open-weights leaderboard, particularly in math and symbolic logic.

Advanced Coding and System Design

For developers, Qwen3.5-397B-A17B for coding excels at multi-file architecture planning and debugging. Unlike smaller models that struggle with state management across large files, the 262K context window allows it to maintain a coherent understanding of complex dependencies. It supports modern programming languages and can generate production-ready boilerplate, unit tests, and documentation.

Multimodal Document Intelligence

The vision capabilities are natively integrated, not bolted on. This allows for:

Complex OCR: Extracting data from dense financial spreadsheets or handwritten notes.
Spatial Reasoning: Analyzing architectural blueprints or circuit diagrams.
Visual Function-Calling: Interpreting a UI screenshot to suggest the next sequence of API calls.

Multilingual Logic

With support for 201 languages, the model is uniquely suited for global enterprise applications. It maintains high instruction-following accuracy even in low-resource languages, making it a preferred choice for local translation and summarization pipelines that require high nuance.

Running Qwen3.5-397B-A17B Locally

The primary challenge for any practitioner is the Qwen3.5-397B-A17B hardware requirements. Because the model has 397 billion parameters, the VRAM footprint is the most significant bottleneck for local execution.

VRAM Requirements & Quantization

To run Qwen3.5-397B-A17B locally, you must select a quantization level that fits your available memory. Running in full FP16 is impractical for most, requiring nearly 800GB of VRAM.

Quantization	VRAM Required (Approx.)	Recommended Hardware
Q2_K (2-bit)	~130 GB	Mac Studio M4 Ultra (192GB)
Q4_K_M (4-bit)	~235 GB	8x RTX 3090/4090 (24GB each) or A100/H100 80GB Cluster
Q8_0 (8-bit)	~420 GB	Enterprise Multi-Node Cluster

The best quantization for Qwen3.5-397B-A17B is generally Q4_K_M. This provides a near-lossless experience compared to FP16 while bringing the memory requirement down to a range manageable by high-end workstation clusters.

Recommended Hardware

The best GPU for Qwen3.5-397B-A17B depends on your budget and form factor:

Consumer Multi-GPU: A rig with 10x RTX 3090/4090 GPUs using NVLink or high-speed PCIe lanes. This is the most cost-effective way to reach the ~240GB VRAM threshold for 4-bit quantization.
Mac Silicon: An Apple M4 Ultra (or M2/M3 Ultra) with 192GB of Unified Memory is the most "plug-and-play" solution, though it will likely necessitate using a lower quantization (Q2_K or Q3_K_L) to fit the model and a sizable KV cache.
Professional Grade: 4x NVIDIA A100 (80GB) or H100 (80GB) cards provide the necessary headroom and memory bandwidth to achieve maximum tokens per second.

Software and Setup

Ollama is the fastest way to get started, as it handles the MoE logic and memory mapping automatically. Use the command ollama run qwen3.5:397b (assuming you have the required VRAM). For more granular control over layers and GPU offloading, llama.cpp or vLLM are recommended.

How It Compares

When evaluating this model against its peers, the distinction usually comes down to the MoE architecture versus dense scaling.

Qwen3.5-397B-A17B vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is active for every token.

Performance: Qwen3.5-397B-A17B is significantly faster in terms of throughput (TPS) because it only activates 17B parameters.
Hardware: Both require similar VRAM (~230GB+ for 4-bit), but Llama 405B requires significantly more compute power (TFLOPS) to generate text at the same speed.
Context: Qwen3.5 offers a native 262K context, whereas Llama 3.1 405B is capped at 128K.

Qwen3.5-397B-A17B vs. DeepSeek-V3

DeepSeek-V3 is a fellow MoE model that pioneered many of the efficiencies Qwen3.5 utilizes.

Multimodality: Qwen3.5-397B-A17B has a slight edge in native vision-language tasks and broader multilingual support (201 languages vs DeepSeek's focus on English/Chinese).
Inference: Both models offer excellent MoE efficiency, making them the two most viable "megamodels" for local practitioners who can afford the VRAM but don't want to wait seconds for a single token.

For practitioners looking for a local AI model 397B parameters 2025 edition, Qwen3.5-397B-A17B is currently the most versatile option for those who need a single model to handle vision, coding, and massive context windows without the latency penalties of dense 400B architectures.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3.5-397B-A17B

397B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~46.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Summarization

Instruction Following

Model Specifications

Parameters397B

Active Params17B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size806.8 GB

Community

Monthly Downloads1.1M

Likes1.5K

Last Updated28 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

88.4

87.8

76.4

28.7

93.3

52.5

87.9

91.2

Overall Score

73.8AA

Benchmark40%

75.8

Popularity25%

65.6

Efficiency20%

60.7

Versatility15%

100.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	42.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	46.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	47.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	49.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	54.0 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	70.2 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	58.6 tok/s	46.0 GB
Google Cloud TPU v5pGoogle	SS	48.4 tok/s	46.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	42.9 tok/s	46.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	64.7 tok/s	46.0 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	35.7 tok/s	46.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	84.0 tok/s	46.0 GB
AMD Instinct MI300XAMD	SS	92.7 tok/s	46.0 GB
Google TPU v7 (Ironwood)Google	SS	129.1 tok/s	46.0 GB
NVIDIA B200 GPUNVIDIA	SS	140.0 tok/s	46.0 GB
AMD Instinct MI325XAMD	SS	105.0 tok/s	46.0 GB
AMD Instinct MI355XAMD	SS	140.0 tok/s	46.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	124.2 tok/s	46.0 GB
Dell Pro Max with GB300Dell	SS	124.2 tok/s	46.0 GB
HP ZGX Fury AI StationHP	SS	124.2 tok/s	46.0 GB
MSI XpertStation WS300MSI	SS	124.2 tok/s	46.0 GB
SuperMicro Super AI StationSuperMicro	SS	124.2 tok/s	46.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	124.2 tok/s	46.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.0 tok/s	46.0 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	9.0 tok/s	46.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.0 tok/s	46.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.7 tok/s	46.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.0 tok/s	46.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.7 tok/s	46.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.7 tok/s	46.0 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.0 tok/s	46.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.5 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-397B-A17B on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.5 tok/s · 150W	$1.12
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 46 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Hybrid Gated DeltaNet

Active vs. Total Parameters

Total Parameters: 397B (Determines VRAM requirements)
Active Parameters: 17B (Determines inference speed/latency)
Modality: Native Multimodal (Vision-Language-Action)
Training Cutoff: 2025

Capabilities & Use Cases

Advanced Coding and System Design

Multimodal Document Intelligence

The vision capabilities are natively integrated, not bolted on. This allows for:

Complex OCR: Extracting data from dense financial spreadsheets or handwritten notes.
Spatial Reasoning: Analyzing architectural blueprints or circuit diagrams.
Visual Function-Calling: Interpreting a UI screenshot to suggest the next sequence of API calls.

Multilingual Logic

Running Qwen3.5-397B-A17B Locally

VRAM Requirements & Quantization

To run Qwen3.5-397B-A17B locally, you must select a quantization level that fits your available memory. Running in full FP16 is impractical for most, requiring nearly 800GB of VRAM.

Quantization	VRAM Required (Approx.)	Recommended Hardware
Q2_K (2-bit)	~130 GB	Mac Studio M4 Ultra (192GB)
Q4_K_M (4-bit)	~235 GB	8x RTX 3090/4090 (24GB each) or A100/H100 80GB Cluster
Q8_0 (8-bit)	~420 GB	Enterprise Multi-Node Cluster

Recommended Hardware

The best GPU for Qwen3.5-397B-A17B depends on your budget and form factor:

Consumer Multi-GPU: A rig with 10x RTX 3090/4090 GPUs using NVLink or high-speed PCIe lanes. This is the most cost-effective way to reach the ~240GB VRAM threshold for 4-bit quantization.
Mac Silicon: An Apple M4 Ultra (or M2/M3 Ultra) with 192GB of Unified Memory is the most "plug-and-play" solution, though it will likely necessitate using a lower quantization (Q2_K or Q3_K_L) to fit the model and a sizable KV cache.
Professional Grade: 4x NVIDIA A100 (80GB) or H100 (80GB) cards provide the necessary headroom and memory bandwidth to achieve maximum tokens per second.

Software and Setup

How It Compares

When evaluating this model against its peers, the distinction usually comes down to the MoE architecture versus dense scaling.

Qwen3.5-397B-A17B vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is active for every token.

Performance: Qwen3.5-397B-A17B is significantly faster in terms of throughput (TPS) because it only activates 17B parameters.
Hardware: Both require similar VRAM (~230GB+ for 4-bit), but Llama 405B requires significantly more compute power (TFLOPS) to generate text at the same speed.
Context: Qwen3.5 offers a native 262K context, whereas Llama 3.1 405B is capped at 128K.

Qwen3.5-397B-A17B vs. DeepSeek-V3

DeepSeek-V3 is a fellow MoE model that pioneered many of the efficiencies Qwen3.5 utilizes.

Multimodality: Qwen3.5-397B-A17B has a slight edge in native vision-language tasks and broader multilingual support (201 languages vs DeepSeek's focus on English/Chinese).
Inference: Both models offer excellent MoE efficiency, making them the two most viable "megamodels" for local practitioners who can afford the VRAM but don't want to wait seconds for a single token.