
Compact multimodal model from the Qwen3.5 small series. 262K context, 201 languages. Thinking and non-thinking modes. Strong performance for its size class.
A strong 9B-parameter dense language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run qwen3.5:9bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 22.7 GB | Low | |
| Q4_K_MRecommended | 24.6 GB | Good | |
| Q5_K_M | 25.5 GB | Very Good | |
| Q6_K | 26.6 GB | Excellent | |
| Q8_0 | 28.8 GB | Near Perfect | |
| FP16 | 37.4 GB | Full |
See which devices can run this model and at what quality level.
| SS | 53.7 tok/s | 24.6 GB | ||
| SS | 58.7 tok/s | 24.6 GB | ||
Origin PC M-CLASS v2Origin PC | SS | 58.7 tok/s | 24.6 GB | |
NVIDIA A100 SXM4 80GBNVIDIA | SS | 66.7 tok/s | 24.6 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 109.6 tok/s | 24.6 GB | |
Google Cloud TPU v5pGoogle | SS | 90.5 tok/s | 24.6 GB | |
| SS | 80.2 tok/s | 24.6 GB | ||
| SS | 121.1 tok/s | 24.6 GB | ||
| SS | 31.4 tok/s | 24.6 GB | ||
Origin PC L-CLASS v2Origin PC | SS | 31.4 tok/s | 24.6 GB | |
NVIDIA H200 SXM 141GBNVIDIA | AA | 157.1 tok/s | 24.6 GB | |
| AA | 173.5 tok/s | 24.6 GB | ||
Google TPU v7 (Ironwood)Google | AA | 241.6 tok/s | 24.6 GB | |
NVIDIA B200 GPUNVIDIA | AA | 261.8 tok/s | 24.6 GB | |
| AA | 196.4 tok/s | 24.6 GB | ||
| AA | 261.8 tok/s | 24.6 GB | ||
NVIDIA L40SNVIDIA | AA | 28.3 tok/s | 24.6 GB | |
| AA | 232.4 tok/s | 24.6 GB | ||
| AA | 232.4 tok/s | 24.6 GB | ||
Gigabyte W775-V10-L01Gigabyte | AA | 232.4 tok/s | 24.6 GB | |
| AA | 232.4 tok/s | 24.6 GB | ||
| AA | 232.4 tok/s | 24.6 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 232.4 tok/s | 24.6 GB | |
| AA | 26.2 tok/s | 24.6 GB | ||
| AA | 26.2 tok/s | 24.6 GB |
Energy cost on AMD Radeon RX 7900 XTX (~31 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Qwen3.5-9B on AMD Radeon RX 7900 XTX · ~31 tok/s · 355W | $0.377 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.1 Flash Lite PreviewGoogle · in $0.250 · out $1.50 | $0.625 |
Grok 4.3 betaxAI · in $3.00 · out $15.00 | $6.60 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Qwen3.5-9B is a dense, multimodal large language model developed by Alibaba Cloud. Positioned as the high-performance entry point of the Qwen3.5 small series, this 9-billion parameter model is engineered to bridge the gap between lightweight edge models and massive data-center-grade LLMs. With a 2025 training cutoff and an Apache 2.0 license, it provides a permissive and up-to-date foundation for local deployment.
The model occupies a competitive niche, directly challenging Meta’s Llama 3.1/3.2 8B and Mistral’s 7B series. Unlike many models in this size class that prioritize either text or vision, Qwen3.5-9B is natively multimodal, handling text, code, and visual inputs within a unified architecture. It is particularly notable for its dual-mode execution—offering both standard "non-thinking" inference and a dedicated "thinking" mode for complex reasoning tasks, a feature increasingly sought after for autonomous agent workflows.
Qwen3.5-9B utilizes a dense Transformer architecture. Unlike Mixture-of-Experts (MoE) models that activate only a fraction of their parameters during inference, this model is fully dense. This means all 9 billion parameters are utilized for every token generated, providing a high level of "knowledge density" per gigabyte of VRAM.
One of the most significant technical advantages of Qwen3.5-9B is its massive context length of 262,144 tokens. For a 9B parameter model, this is an industry-leading specification. It allows practitioners to ingest entire codebases, long technical manuals, or multiple high-resolution images into the prompt without hitting context limits.
The model is trained on a vast corpus supporting 201 languages, making it one of the most capable multilingual models in the sub-10B category. The tokenizer is highly efficient for non-English scripts, which reduces the total token count for multilingual tasks and results in higher effective throughput compared to models with English-centric tokenizers.
The vision capabilities are integrated directly into the model, rather than being a "bolted-on" adapter. This allows for sophisticated reasoning across modalities—for example, explaining a complex architectural diagram or debugging code from a screenshot of a terminal error.
Qwen3.5-9B is a generalist model that excels in environments where hardware is constrained but performance cannot be sacrificed.
The "thinking" mode allows the model to perform internal chain-of-thought processing before delivering a final answer. In Qwen3.5-9B reasoning benchmarks, the model shows a marked improvement over standard 8B-7B models in logical deduction and multi-step math problems. This makes it an ideal candidate for local agents that need to plan actions before executing them.
For developers, this model functions as a highly capable local coding assistant. It supports a wide array of programming languages and benefits significantly from the 262K context window, which enables "repo-aware" chat where the model can reference multiple files simultaneously. It handles boilerplate generation, refactoring, and complex debugging tasks with a level of precision usually reserved for 30B+ parameter models.
Running Qwen3.5-9B locally is highly accessible on modern consumer hardware. Because it is a 9B parameter model, it fits comfortably within the VRAM limits of mid-range and high-end GPUs.
VRAM consumption depends heavily on your choice of quantization. To run the model effectively, use the following guidelines:
When selecting the best GPU for Qwen3.5-9B, consider your context needs. While the model itself fits in 8GB of VRAM at 4-bit quantization, the 262K context window requires additional memory as the KV cache grows.
The fastest way to run Qwen3.5-9B locally is via Ollama. Once installed, you can pull the model with a single command:
ollama run qwen3.5:9b
For vision tasks or specific quantization levels, you can find GGUF or EXL2 weights on Hugging Face and load them into LM Studio or KoboldCPP.
To understand Qwen3.5-9B performance, it is helpful to compare it against its primary rivals: Llama 3.2 8B and Mistral 7B v0.3.
The primary tradeoff when choosing Qwen3.5-9B over a smaller model (like a 3B parameter model) is the Qwen3.5-9B hardware requirements. While a 3B model can run on a phone or an integrated GPU, the 9B model requires a dedicated GPU or high-bandwidth unified memory to maintain acceptable tokens per second. However, for practitioners who need a "local AI model with 9B parameters in 2025," Qwen3.5-9B represents the current state-of-the-art for its size class.