Alibaba Cloud

Qwen 3.5 Omni

An omnimodal Hybrid-Attention MoE model capable of processing text, images, and over 10 hours of continuous audio.

397B paramsMoE256K ctxMultimodal

Run with Ollama Official Page

Capabilities

Chat

Vision

Reasoning

Multilingual

Summarization

Model Specifications

Parameters397B

Active Params17B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

ProviderAlibaba Cloud

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.0BB

Benchmark40%

95.0

Popularity25%

10.0

Efficiency20%

63.0

Versatility15%

72.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	41.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	45.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	46.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	48.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	53.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	69.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	59.7 tok/s	45.2 GB
Google Cloud TPU v5pGoogle	SS	49.3 tok/s	45.2 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	43.7 tok/s	45.2 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	36.3 tok/s	45.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	65.9 tok/s	45.2 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	85.5 tok/s	45.2 GB
AMD Instinct MI300XAMD	SS	94.4 tok/s	45.2 GB
NVIDIA B200 GPUNVIDIA	SS	142.6 tok/s	45.2 GB
AMD Instinct MI325XAMD	SS	106.9 tok/s	45.2 GB
AMD Instinct MI355XAMD	SS	142.6 tok/s	45.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	126.5 tok/s	45.2 GB
Dell Pro Max with GB300Dell	SS	126.5 tok/s	45.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	126.5 tok/s	45.2 GB
HP ZGX Fury AI StationHP	SS	126.5 tok/s	45.2 GB
MSI XpertStation WS300MSI	SS	126.5 tok/s	45.2 GB
SuperMicro Super AI StationSuperMicro	SS	126.5 tok/s	45.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.3 tok/s	45.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.1 tok/s	45.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.3 tok/s	45.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.9 tok/s	45.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.9 tok/s	45.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.9 tok/s	45.2 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.1 tok/s	45.2 GB
Apple M4 Max (40-core GPU)Apple	BB	9.7 tok/s	45.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	9.7 tok/s	45.2 GB

Rows per page

Page 1 of 4

About This Model

Qwen 3.5 Omni is Alibaba Cloud’s flagship multimodal Mixture of Experts (MoE) model, designed to unify text, vision, and audio processing into a single architectural pass. With a massive 397B total parameters—but only 17B active during any single inference step—it represents a significant push toward high-capacity reasoning that remains computationally feasible for structured local deployments.

Unlike previous generations that relied on "stitching" separate models for speech-to-text (ASR) and text-to-speech (TTS), Qwen 3.5 Omni handles these modalities natively. This "omnimodal" approach reduces latency and preserves the nuanced prosody of audio and the spatial context of visual data. For developers, this model serves as an open-weights alternative to closed-source giants like GPT-4o and Gemini 1.5 Pro, offering a 256,000-token context window that can ingest over 10 hours of continuous audio or extensive video files.

Architecture and Technical Efficiency

The defining characteristic of Qwen 3.5 Omni is its Hybrid-Attention MoE architecture. While the model houses 397B parameters, its 17B active parameter count means that during inference, it behaves similarly to a much smaller model in terms of compute requirements, though it still demands the VRAM capacity of a massive model to store the weights.

MoE Advantage: The Mixture of Experts design allows for specialized "expert" sub-networks. This improves the model's performance across diverse tasks—such as switching from complex Python debugging to Mandarin-to-English translation—without the overhead of a dense 400B parameter model.
Context Length: A 256K context window is standard here, enabling massive RAG (Retrieval-Augmented Generation) workloads or the analysis of entire codebases and long-form video content (up to 400 seconds of 720p video at 1 FPS).
Native Multimodality: By utilizing early-fusion training, the model treats image and audio tokens with the same priority as text, leading to superior reasoning in "inter-modal" tasks, such as describing the mood of a speaker based on both their words and their tone of voice.

Capabilities and Practical Use Cases

Qwen 3.5 Omni is not a generalist chatbot; it is a high-reasoning engine capable of sophisticated multimodal logic.

Advanced Multilingual Reasoning

The model supports over 113 languages for audio recognition and 36 for speech generation. This makes it a premier choice for building localized voice assistants or automated translation services that require high-fidelity cultural context.

Vision and Video Analysis

With its ability to process 720p video natively, Qwen 3.5 Omni excels in:

Security and Monitoring: Identifying specific events or anomalies across hours of footage.
Content Creation: Automated timestamping, summarization, and metadata generation for video libraries.
Visual Document Understanding: Analyzing complex PDFs, charts, and architectural blueprints with high spatial accuracy.

Complex Summarization

The 256K context window allows for the ingestion of multiple long-form documents simultaneously. Practitioners use it to synthesize technical documentation, legal contracts, or medical records where cross-referencing between distant parts of the text is required.

Running Qwen 3.5 Omni Locally

Running a 397B parameter model locally is a significant hardware undertaking. Even with its MoE efficiency, the primary bottleneck is VRAM. To run Qwen 3.5 Omni locally, you must account for the total parameter count, not just the active ones.

VRAM Requirements and Quantization

For most practitioners, 4-bit quantization (Q4_K_M) is the "sweet spot" for maintaining intelligence while reducing the memory footprint.

FP16 (Full Precision): ~800GB+ VRAM. Requires an enterprise-grade cluster (e.g., 8x H100 or A100 80GB).
Q4_K_M (Recommended): ~220GB - 240GB VRAM. This is achievable on a multi-GPU setup, such as a 10x RTX 3090/4090 cluster or a high-spec Mac Studio/Mac Pro with 192GB-256GB of Unified Memory.
Q2_K (Minimum): ~140GB VRAM. Significant loss in reasoning capability, but runnable on 6x RTX 3090s.

Recommended Hardware

Mac Users: An M4 Max or M2 Ultra with at least 192GB of Unified Memory is the most seamless way to run this model. Apple’s unified architecture allows the GPU to access the entire memory pool, which is critical for 397B models.
PC/Linux Users: You will need a multi-GPU rig using NVLink or high-speed PCIe lanes. A setup with 4x to 8x RTX 4090s using GGUF offloading in llama.cpp or vLLM is the standard for independent researchers.

Quick Start with Ollama

The fastest way to deploy is via Ollama, which handles the orchestration of MoE weights across your available hardware:

ollama run qwen3.5:397b

For those with limited VRAM, ensure you are using the qwen3.5:cloud variant if you prefer API-based offloading, but for true local execution, the 122B or smaller variants in the Qwen 3.5 family are better suited for single-GPU (24GB) setups.

How It Compares

When evaluating Qwen 3.5 Omni vs. Llama 3.1 405B, the choice depends on your modality needs:

Llama 3.1 405B: Generally shows a slight edge in pure English-language synthetic reasoning and instruction following. However, it is a dense model, meaning it is significantly slower during inference than Qwen's MoE architecture.
Qwen 3.5 Omni: Superior in any task involving audio or video. If your pipeline involves voice-to-voice or image-to-text, Qwen is the clear winner. Its MoE design also provides a higher tokens per second (t/s) rate on comparable hardware once the weights are loaded into VRAM.
DeepSeek-V3: Another MoE competitor. DeepSeek often wins on coding-specific benchmarks, but Qwen 3.5 Omni offers a more robust multimodal suite, particularly regarding native audio processing which DeepSeek currently lacks.

For developers seeking a "Swiss Army Knife" for local AI—capable of seeing, hearing, and reasoning across a quarter-million tokens—Qwen 3.5 Omni is the most versatile open-weights model currently available.

Related Models

Alibaba Cloud

Qwen3.6 35B-A3B

35BMoE

Alibaba Cloud

Qwen3.6-27B

27BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Alibaba Cloud

Qwen 3.5 Omni

An omnimodal Hybrid-Attention MoE model capable of processing text, images, and over 10 hours of continuous audio.

397B paramsMoE256K ctxMultimodal

Run with Ollama Official Page

Capabilities

Chat

Vision

Reasoning

Multilingual

Summarization

Model Specifications

Parameters397B

Active Params17B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

ProviderAlibaba Cloud

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.0BB

Benchmark40%

95.0

Popularity25%

10.0

Efficiency20%

63.0

Versatility15%

72.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	41.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	45.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	46.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	48.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	53.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	69.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	59.7 tok/s	45.2 GB
Google Cloud TPU v5pGoogle	SS	49.3 tok/s	45.2 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	43.7 tok/s	45.2 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	36.3 tok/s	45.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	65.9 tok/s	45.2 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	85.5 tok/s	45.2 GB
AMD Instinct MI300XAMD	SS	94.4 tok/s	45.2 GB
NVIDIA B200 GPUNVIDIA	SS	142.6 tok/s	45.2 GB
AMD Instinct MI325XAMD	SS	106.9 tok/s	45.2 GB
AMD Instinct MI355XAMD	SS	142.6 tok/s	45.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	126.5 tok/s	45.2 GB
Dell Pro Max with GB300Dell	SS	126.5 tok/s	45.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	126.5 tok/s	45.2 GB
HP ZGX Fury AI StationHP	SS	126.5 tok/s	45.2 GB
MSI XpertStation WS300MSI	SS	126.5 tok/s	45.2 GB
SuperMicro Super AI StationSuperMicro	SS	126.5 tok/s	45.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	14.3 tok/s	45.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	7.1 tok/s	45.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	14.3 tok/s	45.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	10.9 tok/s	45.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	10.9 tok/s	45.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	10.9 tok/s	45.2 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	7.1 tok/s	45.2 GB
Apple M4 Max (40-core GPU)Apple	BB	9.7 tok/s	45.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	9.7 tok/s	45.2 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Efficiency

MoE Advantage: The Mixture of Experts design allows for specialized "expert" sub-networks. This improves the model's performance across diverse tasks—such as switching from complex Python debugging to Mandarin-to-English translation—without the overhead of a dense 400B parameter model.
Context Length: A 256K context window is standard here, enabling massive RAG (Retrieval-Augmented Generation) workloads or the analysis of entire codebases and long-form video content (up to 400 seconds of 720p video at 1 FPS).
Native Multimodality: By utilizing early-fusion training, the model treats image and audio tokens with the same priority as text, leading to superior reasoning in "inter-modal" tasks, such as describing the mood of a speaker based on both their words and their tone of voice.

Capabilities and Practical Use Cases

Qwen 3.5 Omni is not a generalist chatbot; it is a high-reasoning engine capable of sophisticated multimodal logic.

Advanced Multilingual Reasoning

Vision and Video Analysis

With its ability to process 720p video natively, Qwen 3.5 Omni excels in:

Security and Monitoring: Identifying specific events or anomalies across hours of footage.
Content Creation: Automated timestamping, summarization, and metadata generation for video libraries.
Visual Document Understanding: Analyzing complex PDFs, charts, and architectural blueprints with high spatial accuracy.

Complex Summarization

Running Qwen 3.5 Omni Locally

VRAM Requirements and Quantization

For most practitioners, 4-bit quantization (Q4_K_M) is the "sweet spot" for maintaining intelligence while reducing the memory footprint.

FP16 (Full Precision): ~800GB+ VRAM. Requires an enterprise-grade cluster (e.g., 8x H100 or A100 80GB).
Q4_K_M (Recommended): ~220GB - 240GB VRAM. This is achievable on a multi-GPU setup, such as a 10x RTX 3090/4090 cluster or a high-spec Mac Studio/Mac Pro with 192GB-256GB of Unified Memory.
Q2_K (Minimum): ~140GB VRAM. Significant loss in reasoning capability, but runnable on 6x RTX 3090s.

Recommended Hardware

Mac Users: An M4 Max or M2 Ultra with at least 192GB of Unified Memory is the most seamless way to run this model. Apple’s unified architecture allows the GPU to access the entire memory pool, which is critical for 397B models.
PC/Linux Users: You will need a multi-GPU rig using NVLink or high-speed PCIe lanes. A setup with 4x to 8x RTX 4090s using GGUF offloading in llama.cpp or vLLM is the standard for independent researchers.

Quick Start with Ollama

The fastest way to deploy is via Ollama, which handles the orchestration of MoE weights across your available hardware:

ollama run qwen3.5:397b

How It Compares

When evaluating Qwen 3.5 Omni vs. Llama 3.1 405B, the choice depends on your modality needs:

Llama 3.1 405B: Generally shows a slight edge in pure English-language synthetic reasoning and instruction following. However, it is a dense model, meaning it is significantly slower during inference than Qwen's MoE architecture.
Qwen 3.5 Omni: Superior in any task involving audio or video. If your pipeline involves voice-to-voice or image-to-text, Qwen is the clear winner. Its MoE design also provides a higher tokens per second (t/s) rate on comparable hardware once the weights are loaded into VRAM.
DeepSeek-V3: Another MoE competitor. DeepSeek often wins on coding-specific benchmarks, but Qwen 3.5 Omni offers a more robust multimodal suite, particularly regarding native audio processing which DeepSeek currently lacks.

Related Models

Alibaba Cloud

Qwen3.6 35B-A3B

35BMoE

Alibaba Cloud

Qwen3.6-27B

27BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.