Alibaba

Qwen3.5-9B

Compact multimodal model from the Qwen3.5 small series. 262K context, 201 languages. Thinking and non-thinking modes. Strong performance for its size class.

9B paramsDense262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 9B-parameter dense language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters.

Run this onApple M4Cheapest card in our directory with comfortable headroom (32 GB) for this model at Q4 (~24.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Multilingual

Math

Instruction Following

Model Specifications

Parameters9B

ArchitectureDense

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size19.3 GB

Community

Monthly Downloads8.0M

Likes1.5K

Last Updated2 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:9b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

81.7

82.5

92.5

71.2

Overall Score

78.0AA

Benchmark40%

82.0

Popularity25%

74.3

Efficiency20%

70.5

Versatility15%

83.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	22.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	24.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	25.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	26.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	28.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	37.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v6e (Trillium)Google	SS	53.7 tok/s	24.6 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	58.7 tok/s	24.6 GB
Origin PC M-CLASS v2Origin PC	SS	58.7 tok/s	24.6 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	66.7 tok/s	24.6 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	109.6 tok/s	24.6 GB
Google Cloud TPU v5pGoogle	SS	90.5 tok/s	24.6 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	80.2 tok/s	24.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	121.1 tok/s	24.6 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	31.4 tok/s	24.6 GB
Origin PC L-CLASS v2Origin PC	SS	31.4 tok/s	24.6 GB
NVIDIA H200 SXM 141GBNVIDIA	AA	157.1 tok/s	24.6 GB
AMD Instinct MI300XAMD	AA	173.5 tok/s	24.6 GB
Google TPU v7 (Ironwood)Google	AA	241.6 tok/s	24.6 GB
NVIDIA B200 GPUNVIDIA	AA	261.8 tok/s	24.6 GB
AMD Instinct MI325XAMD	AA	196.4 tok/s	24.6 GB
AMD Instinct MI355XAMD	AA	261.8 tok/s	24.6 GB
NVIDIA L40SNVIDIA	AA	28.3 tok/s	24.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	232.4 tok/s	24.6 GB
Dell Pro Max with GB300Dell	AA	232.4 tok/s	24.6 GB
Gigabyte W775-V10-L01Gigabyte	AA	232.4 tok/s	24.6 GB
HP ZGX Fury AI StationHP	AA	232.4 tok/s	24.6 GB
MSI XpertStation WS300MSI	AA	232.4 tok/s	24.6 GB
SuperMicro Super AI StationSuperMicro	AA	232.4 tok/s	24.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	26.2 tok/s	24.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	26.2 tok/s	24.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7900 XTX (~31 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-9B on AMD Radeon RX 7900 XTX · ~31 tok/s · 355W	$0.377
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 25 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3.5-9B is a dense, multimodal large language model developed by Alibaba Cloud. Positioned as the high-performance entry point of the Qwen3.5 small series, this 9-billion parameter model is engineered to bridge the gap between lightweight edge models and massive data-center-grade LLMs. With a 2025 training cutoff and an Apache 2.0 license, it provides a permissive and up-to-date foundation for local deployment.

The model occupies a competitive niche, directly challenging Meta’s Llama 3.1/3.2 8B and Mistral’s 7B series. Unlike many models in this size class that prioritize either text or vision, Qwen3.5-9B is natively multimodal, handling text, code, and visual inputs within a unified architecture. It is particularly notable for its dual-mode execution—offering both standard "non-thinking" inference and a dedicated "thinking" mode for complex reasoning tasks, a feature increasingly sought after for autonomous agent workflows.

Architecture & Technical Details

Qwen3.5-9B utilizes a dense Transformer architecture. Unlike Mixture-of-Experts (MoE) models that activate only a fraction of their parameters during inference, this model is fully dense. This means all 9 billion parameters are utilized for every token generated, providing a high level of "knowledge density" per gigabyte of VRAM.

Context Window and Multilingual Support

One of the most significant technical advantages of Qwen3.5-9B is its massive context length of 262,144 tokens. For a 9B parameter model, this is an industry-leading specification. It allows practitioners to ingest entire codebases, long technical manuals, or multiple high-resolution images into the prompt without hitting context limits.

The model is trained on a vast corpus supporting 201 languages, making it one of the most capable multilingual models in the sub-10B category. The tokenizer is highly efficient for non-English scripts, which reduces the total token count for multilingual tasks and results in higher effective throughput compared to models with English-centric tokenizers.

Multimodal Integration

The vision capabilities are integrated directly into the model, rather than being a "bolted-on" adapter. This allows for sophisticated reasoning across modalities—for example, explaining a complex architectural diagram or debugging code from a screenshot of a terminal error.

Capabilities & Use Cases

Qwen3.5-9B is a generalist model that excels in environments where hardware is constrained but performance cannot be sacrificed.

Reasoning and Math

The "thinking" mode allows the model to perform internal chain-of-thought processing before delivering a final answer. In Qwen3.5-9B reasoning benchmarks, the model shows a marked improvement over standard 8B-7B models in logical deduction and multi-step math problems. This makes it an ideal candidate for local agents that need to plan actions before executing them.

Qwen3.5-9B for Coding

For developers, this model functions as a highly capable local coding assistant. It supports a wide array of programming languages and benefits significantly from the 262K context window, which enables "repo-aware" chat where the model can reference multiple files simultaneously. It handles boilerplate generation, refactoring, and complex debugging tasks with a level of precision usually reserved for 30B+ parameter models.

Multilingual and Vision Tasks

OCR and Document Analysis: Extracting structured data from images, PDFs, and handwritten notes.
Translation and Localization: High-fidelity translation across 200+ languages, including low-resource languages.
Local RAG (Retrieval-Augmented Generation): The 262K context window allows for massive "needle-in-a-haystack" retrieval, making it a powerhouse for local document search and synthesis.

Running Qwen3.5-9B Locally

Running Qwen3.5-9B locally is highly accessible on modern consumer hardware. Because it is a 9B parameter model, it fits comfortably within the VRAM limits of mid-range and high-end GPUs.

Qwen3.5-9B VRAM Requirements

VRAM consumption depends heavily on your choice of quantization. To run the model effectively, use the following guidelines:

FP16 (Unquantized): ~18 GB VRAM. Requires an RTX 3090/4090 or a professional card like the A6000.
Q8_0 (8-bit): ~10-11 GB VRAM. Fits on an RTX 3060 (12GB) or RTX 4070 (12GB).
Q4_K_M (4-bit): ~6-7 GB VRAM. This is the best quantization for Qwen3.5-9B for most users, offering a near-transparent loss in perplexity while fitting on entry-level hardware like the RTX 4060 (8GB).

Hardware Recommendations

When selecting the best GPU for Qwen3.5-9B, consider your context needs. While the model itself fits in 8GB of VRAM at 4-bit quantization, the 262K context window requires additional memory as the KV cache grows.

NVIDIA Users: An RTX 4060 Ti (16GB) or RTX 3060 (12GB) is the sweet spot for budget-conscious builds. For maximum Qwen3.5-9B tokens per second, an RTX 4090 will deliver instantaneous responses, often exceeding 100+ TPS at 4-bit quantization.
Mac Users: Any Apple Silicon Mac with at least 16GB of Unified Memory (M2/M3/M4) will run this model smoothly. An M4 Max with 64GB+ of memory is ideal if you intend to utilize the full 262K context window.

Quick Start with Ollama

The fastest way to run Qwen3.5-9B locally is via Ollama. Once installed, you can pull the model with a single command:

ollama run qwen3.5:9b

For vision tasks or specific quantization levels, you can find GGUF or EXL2 weights on Hugging Face and load them into LM Studio or KoboldCPP.

How It Compares

To understand Qwen3.5-9B performance, it is helpful to compare it against its primary rivals: Llama 3.2 8B and Mistral 7B v0.3.

Qwen3.5-9B vs Llama 3.2 8B

Context: Qwen3.5-9B wins significantly with 262K tokens vs Llama 3.2’s 128K.
Multilingual: Qwen supports 201 languages, whereas Llama is more focused on Western European languages.
Vision: Qwen3.5-9B has native multimodal capabilities that outperform Llama 3.2 8B in complex visual reasoning and OCR tasks.
Licensing: Both are permissive (Apache 2.0 for Qwen vs Llama Community License), but Apache 2.0 is generally preferred for open-source purity.

Qwen3.5-9B vs Mistral 7B v0.3

Parameters: Qwen has a slightly higher parameter count (9B vs 7B), leading to better nuance in instruction following.
Knowledge Cutoff: Qwen’s 2025 cutoff makes it much more relevant for current events and modern coding libraries compared to older Mistral checkpoints.
Reasoning: The "thinking" mode in Qwen3.5-9B provides a distinct advantage in logic-heavy tasks where Mistral 7B often requires extensive prompt engineering to achieve similar results.

Performance Tradeoffs

The primary tradeoff when choosing Qwen3.5-9B over a smaller model (like a 3B parameter model) is the Qwen3.5-9B hardware requirements. While a 3B model can run on a phone or an integrated GPU, the 9B model requires a dedicated GPU or high-bandwidth unified memory to maintain acceptable tokens per second. However, for practitioners who need a "local AI model with 9B parameters in 2025," Qwen3.5-9B represents the current state-of-the-art for its size class.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3.5-9B

Compact multimodal model from the Qwen3.5 small series. 262K context, 201 languages. Thinking and non-thinking modes. Strong performance for its size class.

9B paramsDense262K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 9B-parameter dense language model from Alibaba. Pulls ahead on competition math (AIME 2026) (93/100), so reach for it when that's the dimension that matters.

Run this onApple M4Cheapest card in our directory with comfortable headroom (32 GB) for this model at Q4 (~24.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Multilingual

Math

Instruction Following

Model Specifications

Parameters9B

ArchitectureDense

Context Length262K tokens

ModalityMultimodal

Training Cutoff2025

ProviderAlibaba

Download Size19.3 GB

Community

Monthly Downloads8.0M

Likes1.5K

Last Updated2 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3.5:9b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

81.7

82.5

92.5

71.2

Overall Score

78.0AA

Benchmark40%

82.0

Popularity25%

74.3

Efficiency20%

70.5

Versatility15%

83.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	22.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	24.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	25.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	26.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	28.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	37.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v6e (Trillium)Google	SS	53.7 tok/s	24.6 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	58.7 tok/s	24.6 GB
Origin PC M-CLASS v2Origin PC	SS	58.7 tok/s	24.6 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	66.7 tok/s	24.6 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	109.6 tok/s	24.6 GB
Google Cloud TPU v5pGoogle	SS	90.5 tok/s	24.6 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	80.2 tok/s	24.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	121.1 tok/s	24.6 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	31.4 tok/s	24.6 GB
Origin PC L-CLASS v2Origin PC	SS	31.4 tok/s	24.6 GB
NVIDIA H200 SXM 141GBNVIDIA	AA	157.1 tok/s	24.6 GB
AMD Instinct MI300XAMD	AA	173.5 tok/s	24.6 GB
Google TPU v7 (Ironwood)Google	AA	241.6 tok/s	24.6 GB
NVIDIA B200 GPUNVIDIA	AA	261.8 tok/s	24.6 GB
AMD Instinct MI325XAMD	AA	196.4 tok/s	24.6 GB
AMD Instinct MI355XAMD	AA	261.8 tok/s	24.6 GB
NVIDIA L40SNVIDIA	AA	28.3 tok/s	24.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	232.4 tok/s	24.6 GB
Dell Pro Max with GB300Dell	AA	232.4 tok/s	24.6 GB
Gigabyte W775-V10-L01Gigabyte	AA	232.4 tok/s	24.6 GB
HP ZGX Fury AI StationHP	AA	232.4 tok/s	24.6 GB
MSI XpertStation WS300MSI	AA	232.4 tok/s	24.6 GB
SuperMicro Super AI StationSuperMicro	AA	232.4 tok/s	24.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	26.2 tok/s	24.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	AA	26.2 tok/s	24.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7900 XTX (~31 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3.5-9B on AMD Radeon RX 7900 XTX · ~31 tok/s · 355W	$0.377
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 25 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.27
NVIDIA GeForce RTX 5090Vast.ai · On-Demand · 32 GB VRAM	$0.30
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Context Window and Multilingual Support

Multimodal Integration

Capabilities & Use Cases

Qwen3.5-9B is a generalist model that excels in environments where hardware is constrained but performance cannot be sacrificed.

Reasoning and Math

Qwen3.5-9B for Coding

Multilingual and Vision Tasks

OCR and Document Analysis: Extracting structured data from images, PDFs, and handwritten notes.
Translation and Localization: High-fidelity translation across 200+ languages, including low-resource languages.
Local RAG (Retrieval-Augmented Generation): The 262K context window allows for massive "needle-in-a-haystack" retrieval, making it a powerhouse for local document search and synthesis.

Running Qwen3.5-9B Locally

Running Qwen3.5-9B locally is highly accessible on modern consumer hardware. Because it is a 9B parameter model, it fits comfortably within the VRAM limits of mid-range and high-end GPUs.

Qwen3.5-9B VRAM Requirements

VRAM consumption depends heavily on your choice of quantization. To run the model effectively, use the following guidelines:

FP16 (Unquantized): ~18 GB VRAM. Requires an RTX 3090/4090 or a professional card like the A6000.
Q8_0 (8-bit): ~10-11 GB VRAM. Fits on an RTX 3060 (12GB) or RTX 4070 (12GB).
Q4_K_M (4-bit): ~6-7 GB VRAM. This is the best quantization for Qwen3.5-9B for most users, offering a near-transparent loss in perplexity while fitting on entry-level hardware like the RTX 4060 (8GB).

Hardware Recommendations

NVIDIA Users: An RTX 4060 Ti (16GB) or RTX 3060 (12GB) is the sweet spot for budget-conscious builds. For maximum Qwen3.5-9B tokens per second, an RTX 4090 will deliver instantaneous responses, often exceeding 100+ TPS at 4-bit quantization.
Mac Users: Any Apple Silicon Mac with at least 16GB of Unified Memory (M2/M3/M4) will run this model smoothly. An M4 Max with 64GB+ of memory is ideal if you intend to utilize the full 262K context window.

Quick Start with Ollama

The fastest way to run Qwen3.5-9B locally is via Ollama. Once installed, you can pull the model with a single command:

ollama run qwen3.5:9b

For vision tasks or specific quantization levels, you can find GGUF or EXL2 weights on Hugging Face and load them into LM Studio or KoboldCPP.

How It Compares

To understand Qwen3.5-9B performance, it is helpful to compare it against its primary rivals: Llama 3.2 8B and Mistral 7B v0.3.

Qwen3.5-9B vs Llama 3.2 8B

Context: Qwen3.5-9B wins significantly with 262K tokens vs Llama 3.2’s 128K.
Multilingual: Qwen supports 201 languages, whereas Llama is more focused on Western European languages.
Vision: Qwen3.5-9B has native multimodal capabilities that outperform Llama 3.2 8B in complex visual reasoning and OCR tasks.
Licensing: Both are permissive (Apache 2.0 for Qwen vs Llama Community License), but Apache 2.0 is generally preferred for open-source purity.

Qwen3.5-9B vs Mistral 7B v0.3

Parameters: Qwen has a slightly higher parameter count (9B vs 7B), leading to better nuance in instruction following.
Knowledge Cutoff: Qwen’s 2025 cutoff makes it much more relevant for current events and modern coding libraries compared to older Mistral checkpoints.
Reasoning: The "thinking" mode in Qwen3.5-9B provides a distinct advantage in logic-heavy tasks where Mistral 7B often requires extensive prompt engineering to achieve similar results.