Alibaba

Qwen3-32B

Dense 32.8B model with hybrid thinking modes. Surpasses o1 on LiveCodeBench. 131K context, 119 languages.

32.8B paramsDense131K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 32.8B-parameter dense language model from Alibaba. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~53.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters32.8B

ArchitectureDense

Context Length131K tokens

ModalityText Only

Training Cutoff2024

ProviderAlibaba

Download Size65.5 GB

Community

Monthly Downloads7.1M

Likes692

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3:32b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

76.0

Overall Score

68.7BB

Benchmark40%

76.0

Popularity25%

72.0

Efficiency20%

54.1

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	47.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	53.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	57.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	61.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	69.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	100.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v5pGoogle	SS	41.3 tok/s	53.9 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	50.0 tok/s	53.9 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	55.2 tok/s	53.9 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	71.7 tok/s	53.9 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	36.6 tok/s	53.9 GB
AMD Instinct MI300XAMD	SS	79.1 tok/s	53.9 GB
Google TPU v7 (Ironwood)Google	SS	110.2 tok/s	53.9 GB
NVIDIA B200 GPUNVIDIA	SS	119.4 tok/s	53.9 GB
AMD Instinct MI325XAMD	SS	89.6 tok/s	53.9 GB
AMD Instinct MI355XAMD	SS	119.4 tok/s	53.9 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	30.4 tok/s	53.9 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	106.0 tok/s	53.9 GB
Dell Pro Max with GB300Dell	SS	106.0 tok/s	53.9 GB
Gigabyte W775-V10-L01Gigabyte	SS	106.0 tok/s	53.9 GB
HP ZGX Fury AI StationHP	SS	106.0 tok/s	53.9 GB
MSI XpertStation WS300MSI	SS	106.0 tok/s	53.9 GB
SuperMicro Super AI StationSuperMicro	SS	106.0 tok/s	53.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	11.9 tok/s	53.9 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	7.6 tok/s	53.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.2 tok/s	53.9 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.2 tok/s	53.9 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.2 tok/s	53.9 GB
Apple M4 Max (40-core GPU)Apple	BB	8.2 tok/s	53.9 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.0 tok/s	53.9 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	11.9 tok/s	53.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~4.1 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3-32B on Apple M4 Pro (14-core CPU, 20-core GPU) · ~4.1 tok/s · 60W	$0.491
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 54 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3-32B is a dense, 32.8-billion parameter large language model developed by Alibaba Cloud. Positioned as a mid-sized powerhouse, it is designed to deliver high-tier reasoning and coding performance typically reserved for models twice its size. As the successor to the widely adopted Qwen2.5 series, Qwen3-32B introduces a hybrid thinking mode that allows it to toggle between standard fast inference and deeper "chain-of-thought" reasoning.

For practitioners, this model occupies a "Goldilocks" zone in the 2025 local AI landscape. It is small enough to run on high-end consumer hardware like the NVIDIA RTX 4090 with appropriate quantization, yet powerful enough to handle complex instruction-following and agentic workflows. In specific benchmarks like LiveCodeBench, Qwen3-32B has demonstrated performance that surpasses OpenAI’s o1-preview, making it a top-tier choice for developers who need a local AI model with 32.8B parameters that doesn't compromise on logic or math.

Architecture & Technical Details

The architecture of Qwen3-32B is a standard dense transformer. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, every one of the 32.8 billion parameters in Qwen3-32B is utilized for every token generated. This results in higher "reasoning density" and more consistent performance across diverse tasks, though it requires more VRAM than an MoE of the same active parameter count.

Key technical specifications include:

Parameters: 32.8 billion (Dense)
Context Window: 131,072 tokens (128k)
Training Cutoff: 2024
License: Apache 2.0 (Permissive for commercial use)
Modality: Text-only

The 131,072-token context length is a critical feature for local practitioners. It allows for the ingestion of entire codebases, long technical documents, or extensive chat histories without the immediate need for complex RAG (Retrieval-Augmented Generation) pipelines. The model utilizes RoPE (Rotary Positional Embedding) scaling to maintain coherence across this massive window, ensuring that Qwen3-32B performance remains stable even as the KV cache fills up.

Capabilities & Use Cases

Qwen3-32B is a general-purpose model with a heavy lean toward technical and multilingual proficiency. It is trained on a diverse dataset spanning 119 languages, making it one of the most capable multilingual models available for local deployment.

Technical Reasoning and Math

The model's standout feature is its "thinking" capability. It excels at multi-step logic puzzles and complex mathematical proofs. In local environments, this makes it ideal for autonomous agents that need to plan, verify their own steps, and correct errors before delivering a final output.

Qwen3-32B for Coding

For developers, this model is a significant upgrade over previous iterations. It supports dozens of programming languages and is particularly adept at:

Refactoring: Understanding large blocks of code and suggesting optimizations.
Unit Test Generation: Writing comprehensive tests based on function signatures and edge cases.
Debugging: Tracing logic errors in complex scripts.
Function-Calling: Its native support for tool-use allows it to interface with local APIs, file systems, and databases with high reliability.

Multilingual Support

With support for 119 languages, Qwen3-32B is a primary candidate for translation tasks and localized content generation. It handles non-Latin scripts and low-resource languages with significantly higher perplexity scores than many Western-centric models of similar size.

Running Qwen3-32B Locally

To run Qwen3-32B locally, your primary constraint will be VRAM. Because this is a dense 32.8B model, the memory footprint is substantial compared to 7B or 14B models.

Qwen3-32B VRAM Requirements

BF16 (Uncompressed): ~66 GB (Requires 2x A6000 or 3x RTX 3090/4090)
Q8_0 (8-bit): ~35 GB (Requires 2x RTX 3090/4090 or Mac Studio)
Q4_K_M (4-bit): ~20-22 GB (Fits on a single RTX 3090/4090 24GB)
Q3_K_L (3-bit): ~15 GB (Fits on an RTX 3080 16GB or RTX 4080 16GB)

Best Quantization for Qwen3-32B

For most practitioners, the best quantization for Qwen3-32B is Q4_K_M (GGUF) or 4-bit EXL2. At 4-bit quantization, the model retains over 98% of its original intelligence while fitting comfortably within the 24GB VRAM buffer of a consumer flagship GPU. If you are using an Apple Silicon Mac with unified memory, Q6_K or Q8_0 is recommended for maximum precision, provided you have at least 48GB of total RAM.

Qwen3-32B Hardware Requirements

The best GPU for Qwen3-32B in a single-card setup is the NVIDIA RTX 4090. It provides the necessary 24GB of VRAM and the high memory bandwidth required to maintain usable inference speeds.

On Apple hardware, an M2/M3/M4 Max with 64GB or more of Unified Memory is the ideal platform. This allows you to run higher-precision versions of the model while leaving enough overhead for the OS and the 131K context window's KV cache.

Expected Performance (Tokens Per Second)

Qwen3-32B tokens per second will vary based on your hardware and quantization:

RTX 4090 (4-bit EXL2): 25-35 t/s
RTX 3090 (4-bit GGUF via llama.cpp): 18-24 t/s
Mac Studio M2 Ultra: 30-40 t/s
Dual RTX 3090 (P40/P100): 15-20 t/s (limited by P2P bandwidth)

The quickest way to get started is using Ollama. Simply run ollama run qwen3:32b to download the default 4-bit quant and begin interacting with the model via CLI or local web UIs like Open WebUI.

How It Compares

When evaluating Qwen3-32B, it is most frequently compared to Mistral Small (22B/24B) and Llama 3.1 70B.

Qwen3-32B vs. Mistral Small

Mistral Small is more efficient in terms of VRAM, fitting easily into 16GB cards at 4-bit. However, Qwen3-32B significantly outperforms Mistral Small in coding and complex reasoning benchmarks. If your hardware can handle the extra 6-8GB of VRAM, Qwen3-32B is the superior choice for technical workloads.

Qwen3-32B vs. Llama 3.1 70B

Llama 3.1 70B is the industry standard for open-weight models, but it is much harder to run locally. A 70B model requires dual 3090/4090s even at 4-bit. Qwen3-32B manages to match or exceed Llama 3.1 70B in several Qwen3-32B reasoning benchmarks and coding tasks while being half the size. For users with a single 24GB GPU, Qwen3-32B provides "70B-class" intelligence without the need for a multi-GPU rig.

Qwen3-32B vs. Qwen2.5-32B

The leap from Qwen2.5 to Qwen3 is primarily in the "thinking" architecture. While Qwen2.5 was an excellent generalist, Qwen3-32B is far more capable of handling long-form reasoning and "chain-of-thought" prompts. It is less prone to hallucinations in math and logic-heavy tasks, making it a more reliable partner for local development and agentic automation.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3-32B

Dense 32.8B model with hybrid thinking modes. Surpasses o1 on LiveCodeBench. 131K context, 119 languages.

32.8B paramsDense131K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 32.8B-parameter dense language model from Alibaba. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint. On the rise in download charts.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~53.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters32.8B

ArchitectureDense

Context Length131K tokens

ModalityText Only

Training Cutoff2024

ProviderAlibaba

Download Size65.5 GB

Community

Monthly Downloads7.1M

Likes692

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3:32b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

76.0

Overall Score

68.7BB

Benchmark40%

76.0

Popularity25%

72.0

Efficiency20%

54.1

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	47.0 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	53.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	57.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	61.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	69.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	100.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v5pGoogle	SS	41.3 tok/s	53.9 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	50.0 tok/s	53.9 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	55.2 tok/s	53.9 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	71.7 tok/s	53.9 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	36.6 tok/s	53.9 GB
AMD Instinct MI300XAMD	SS	79.1 tok/s	53.9 GB
Google TPU v7 (Ironwood)Google	SS	110.2 tok/s	53.9 GB
NVIDIA B200 GPUNVIDIA	SS	119.4 tok/s	53.9 GB
AMD Instinct MI325XAMD	SS	89.6 tok/s	53.9 GB
AMD Instinct MI355XAMD	SS	119.4 tok/s	53.9 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	30.4 tok/s	53.9 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	106.0 tok/s	53.9 GB
Dell Pro Max with GB300Dell	SS	106.0 tok/s	53.9 GB
Gigabyte W775-V10-L01Gigabyte	SS	106.0 tok/s	53.9 GB
HP ZGX Fury AI StationHP	SS	106.0 tok/s	53.9 GB
MSI XpertStation WS300MSI	SS	106.0 tok/s	53.9 GB
SuperMicro Super AI StationSuperMicro	SS	106.0 tok/s	53.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	11.9 tok/s	53.9 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	7.6 tok/s	53.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.2 tok/s	53.9 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.2 tok/s	53.9 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.2 tok/s	53.9 GB
Apple M4 Max (40-core GPU)Apple	BB	8.2 tok/s	53.9 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.0 tok/s	53.9 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	11.9 tok/s	53.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~4.1 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3-32B on Apple M4 Pro (14-core CPU, 20-core GPU) · ~4.1 tok/s · 60W	$0.491
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 54 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key technical specifications include:

Parameters: 32.8 billion (Dense)
Context Window: 131,072 tokens (128k)
Training Cutoff: 2024
License: Apache 2.0 (Permissive for commercial use)
Modality: Text-only

Capabilities & Use Cases

Technical Reasoning and Math

Qwen3-32B for Coding

For developers, this model is a significant upgrade over previous iterations. It supports dozens of programming languages and is particularly adept at:

Refactoring: Understanding large blocks of code and suggesting optimizations.
Unit Test Generation: Writing comprehensive tests based on function signatures and edge cases.
Debugging: Tracing logic errors in complex scripts.
Function-Calling: Its native support for tool-use allows it to interface with local APIs, file systems, and databases with high reliability.

Multilingual Support

Running Qwen3-32B Locally

To run Qwen3-32B locally, your primary constraint will be VRAM. Because this is a dense 32.8B model, the memory footprint is substantial compared to 7B or 14B models.

Qwen3-32B VRAM Requirements

BF16 (Uncompressed): ~66 GB (Requires 2x A6000 or 3x RTX 3090/4090)
Q8_0 (8-bit): ~35 GB (Requires 2x RTX 3090/4090 or Mac Studio)
Q4_K_M (4-bit): ~20-22 GB (Fits on a single RTX 3090/4090 24GB)
Q3_K_L (3-bit): ~15 GB (Fits on an RTX 3080 16GB or RTX 4080 16GB)

Best Quantization for Qwen3-32B

Qwen3-32B Hardware Requirements

The best GPU for Qwen3-32B in a single-card setup is the NVIDIA RTX 4090. It provides the necessary 24GB of VRAM and the high memory bandwidth required to maintain usable inference speeds.