made by agents

Compact multimodal model from the Qwen3.5 small series. 262K context, 201 languages. Thinking and non-thinking modes. Strong performance for its size class.
Copy and paste this command to start running the model locally.
ollama run qwen3.5:9bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 22.7 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 24.6 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 25.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 26.6 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 28.8 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 37.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 53.7 tok/s | 24.6 GB | ||
| SS | 58.7 tok/s | 24.6 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 66.7 tok/s | 24.6 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 109.6 tok/s | 24.6 GB | |
Google Cloud TPU v5pGoogle | SS | 90.5 tok/s | 24.6 GB | |
| SS | 80.2 tok/s | 24.6 GB | ||
| SS | 121.1 tok/s | 24.6 GB | ||
| SS | 31.4 tok/s | 24.6 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | AA | 157.1 tok/s | 24.6 GB | |
| AA | 173.5 tok/s | 24.6 GB | ||
NVIDIA B200 GPUNVIDIA | AA | 261.8 tok/s | 24.6 GB | |
| AA | 196.4 tok/s | 24.6 GB | ||
| AA | 261.8 tok/s | 24.6 GB | ||
NVIDIA L40SNVIDIA | AA | 28.3 tok/s | 24.6 GB | |
| AA | 26.2 tok/s | 24.6 GB | ||
| AA | 26.2 tok/s | 24.6 GB | ||
| AA | 26.8 tok/s | 24.6 GB | ||
| AA | 26.8 tok/s | 24.6 GB | ||
| BB | 20.1 tok/s | 24.6 GB | ||
| BB | 20.1 tok/s | 24.6 GB | ||
| BB | 20.1 tok/s | 24.6 GB | ||
| BB | 13.1 tok/s | 24.6 GB | ||
| BB | 17.9 tok/s | 24.6 GB | ||
| BB | 17.9 tok/s | 24.6 GB | ||
| BB | 17.9 tok/s | 24.6 GB |
Qwen3.5-9B is a dense, multimodal large language model developed by Alibaba Cloud. Positioned as the high-performance entry point of the Qwen3.5 small series, this 9-billion parameter model is engineered to bridge the gap between lightweight edge models and massive data-center-grade LLMs. With a 2025 training cutoff and an Apache 2.0 license, it provides a permissive and up-to-date foundation for local deployment.
The model occupies a competitive niche, directly challenging Meta’s Llama 3.1/3.2 8B and Mistral’s 7B series. Unlike many models in this size class that prioritize either text or vision, Qwen3.5-9B is natively multimodal, handling text, code, and visual inputs within a unified architecture. It is particularly notable for its dual-mode execution—offering both standard "non-thinking" inference and a dedicated "thinking" mode for complex reasoning tasks, a feature increasingly sought after for autonomous agent workflows.
Qwen3.5-9B utilizes a dense Transformer architecture. Unlike Mixture-of-Experts (MoE) models that activate only a fraction of their parameters during inference, this model is fully dense. This means all 9 billion parameters are utilized for every token generated, providing a high level of "knowledge density" per gigabyte of VRAM.
One of the most significant technical advantages of Qwen3.5-9B is its massive context length of 262,144 tokens. For a 9B parameter model, this is an industry-leading specification. It allows practitioners to ingest entire codebases, long technical manuals, or multiple high-resolution images into the prompt without hitting context limits.
The model is trained on a vast corpus supporting 201 languages, making it one of the most capable multilingual models in the sub-10B category. The tokenizer is highly efficient for non-English scripts, which reduces the total token count for multilingual tasks and results in higher effective throughput compared to models with English-centric tokenizers.
The vision capabilities are integrated directly into the model, rather than being a "bolted-on" adapter. This allows for sophisticated reasoning across modalities—for example, explaining a complex architectural diagram or debugging code from a screenshot of a terminal error.
Qwen3.5-9B is a generalist model that excels in environments where hardware is constrained but performance cannot be sacrificed.
The "thinking" mode allows the model to perform internal chain-of-thought processing before delivering a final answer. In Qwen3.5-9B reasoning benchmarks, the model shows a marked improvement over standard 8B-7B models in logical deduction and multi-step math problems. This makes it an ideal candidate for local agents that need to plan actions before executing them.
For developers, this model functions as a highly capable local coding assistant. It supports a wide array of programming languages and benefits significantly from the 262K context window, which enables "repo-aware" chat where the model can reference multiple files simultaneously. It handles boilerplate generation, refactoring, and complex debugging tasks with a level of precision usually reserved for 30B+ parameter models.
Running Qwen3.5-9B locally is highly accessible on modern consumer hardware. Because it is a 9B parameter model, it fits comfortably within the VRAM limits of mid-range and high-end GPUs.
VRAM consumption depends heavily on your choice of quantization. To run the model effectively, use the following guidelines:
When selecting the best GPU for Qwen3.5-9B, consider your context needs. While the model itself fits in 8GB of VRAM at 4-bit quantization, the 262K context window requires additional memory as the KV cache grows.
The fastest way to run Qwen3.5-9B locally is via Ollama. Once installed, you can pull the model with a single command:
ollama run qwen3.5:9b
For vision tasks or specific quantization levels, you can find GGUF or EXL2 weights on Hugging Face and load them into LM Studio or KoboldCPP.
To understand Qwen3.5-9B performance, it is helpful to compare it against its primary rivals: Llama 3.2 8B and Mistral 7B v0.3.
The primary tradeoff when choosing Qwen3.5-9B over a smaller model (like a 3B parameter model) is the Qwen3.5-9B hardware requirements. While a 3B model can run on a phone or an integrated GPU, the 9B model requires a dedicated GPU or high-bandwidth unified memory to maintain acceptable tokens per second. However, for practitioners who need a "local AI model with 9B parameters in 2025," Qwen3.5-9B represents the current state-of-the-art for its size class.