
An omnimodal Hybrid-Attention MoE model capable of processing text, images, and over 10 hours of continuous audio.
Copy and paste this command to start running the model locally.
ollama run qwen3.5No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 41.6 GB | Low | |
| Q4_K_MRecommended | 45.2 GB | Good | |
| Q5_K_M | 46.9 GB | Very Good | |
| Q6_K | 48.9 GB | Excellent | |
| Q8_0 | 53.2 GB | Near Perfect | |
| FP16 | 69.3 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 59.7 tok/s | 45.2 GB | |
Google Cloud TPU v5pGoogle | SS | 49.3 tok/s | 45.2 GB | |
| SS | 43.7 tok/s | 45.2 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 36.3 tok/s | 45.2 GB | |
| SS | 65.9 tok/s | 45.2 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 85.5 tok/s | 45.2 GB | |
| SS | 94.4 tok/s | 45.2 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 142.6 tok/s | 45.2 GB | |
| SS | 106.9 tok/s | 45.2 GB | ||
| SS | 142.6 tok/s | 45.2 GB | ||
| SS | 126.5 tok/s | 45.2 GB | ||
| SS | 126.5 tok/s | 45.2 GB | ||
Gigabyte W775-V10-L01Gigabyte | SS | 126.5 tok/s | 45.2 GB | |
| SS | 126.5 tok/s | 45.2 GB | ||
| SS | 126.5 tok/s | 45.2 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 126.5 tok/s | 45.2 GB | |
| AA | 14.3 tok/s | 45.2 GB | ||
| BB | 7.1 tok/s | 45.2 GB | ||
| BB | 14.3 tok/s | 45.2 GB | ||
| BB | 10.9 tok/s | 45.2 GB | ||
| BB | 10.9 tok/s | 45.2 GB | ||
| BB | 10.9 tok/s | 45.2 GB | ||
| BB | 7.1 tok/s | 45.2 GB | ||
| BB | 9.7 tok/s | 45.2 GB | ||
| BB | 9.7 tok/s | 45.2 GB |
Qwen 3.5 Omni is Alibaba Cloud’s flagship multimodal Mixture of Experts (MoE) model, designed to unify text, vision, and audio processing into a single architectural pass. With a massive 397B total parameters—but only 17B active during any single inference step—it represents a significant push toward high-capacity reasoning that remains computationally feasible for structured local deployments.
Unlike previous generations that relied on "stitching" separate models for speech-to-text (ASR) and text-to-speech (TTS), Qwen 3.5 Omni handles these modalities natively. This "omnimodal" approach reduces latency and preserves the nuanced prosody of audio and the spatial context of visual data. For developers, this model serves as an open-weights alternative to closed-source giants like GPT-4o and Gemini 1.5 Pro, offering a 256,000-token context window that can ingest over 10 hours of continuous audio or extensive video files.
The defining characteristic of Qwen 3.5 Omni is its Hybrid-Attention MoE architecture. While the model houses 397B parameters, its 17B active parameter count means that during inference, it behaves similarly to a much smaller model in terms of compute requirements, though it still demands the VRAM capacity of a massive model to store the weights.
Qwen 3.5 Omni is not a generalist chatbot; it is a high-reasoning engine capable of sophisticated multimodal logic.
The model supports over 113 languages for audio recognition and 36 for speech generation. This makes it a premier choice for building localized voice assistants or automated translation services that require high-fidelity cultural context.
With its ability to process 720p video natively, Qwen 3.5 Omni excels in:
The 256K context window allows for the ingestion of multiple long-form documents simultaneously. Practitioners use it to synthesize technical documentation, legal contracts, or medical records where cross-referencing between distant parts of the text is required.
Running a 397B parameter model locally is a significant hardware undertaking. Even with its MoE efficiency, the primary bottleneck is VRAM. To run Qwen 3.5 Omni locally, you must account for the total parameter count, not just the active ones.
For most practitioners, 4-bit quantization (Q4_K_M) is the "sweet spot" for maintaining intelligence while reducing the memory footprint.
llama.cpp or vLLM is the standard for independent researchers.The fastest way to deploy is via Ollama, which handles the orchestration of MoE weights across your available hardware:
ollama run qwen3.5:397b
For those with limited VRAM, ensure you are using the qwen3.5:cloud variant if you prefer API-based offloading, but for true local execution, the 122B or smaller variants in the Qwen 3.5 family are better suited for single-GPU (24GB) setups.
When evaluating Qwen 3.5 Omni vs. Llama 3.1 405B, the choice depends on your modality needs:
For developers seeking a "Swiss Army Knife" for local AI—capable of seeing, hearing, and reasoning across a quarter-million tokens—Qwen 3.5 Omni is the most versatile open-weights model currently available.