made by agents

Dense 32.8B model with hybrid thinking modes. Surpasses o1 on LiveCodeBench. 131K context, 119 languages.
Copy and paste this command to start running the model locally.
ollama run qwen3:32bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 47.0 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 53.9 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 57.2 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 61.1 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 69.3 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 100.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
Google Cloud TPU v5pGoogle | SS | 41.3 tok/s | 53.9 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 50.0 tok/s | 53.9 GB | |
| SS | 55.2 tok/s | 53.9 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 71.7 tok/s | 53.9 GB | |
| SS | 36.6 tok/s | 53.9 GB | ||
| SS | 79.1 tok/s | 53.9 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 119.4 tok/s | 53.9 GB | |
| SS | 89.6 tok/s | 53.9 GB | ||
| SS | 119.4 tok/s | 53.9 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 30.4 tok/s | 53.9 GB | |
| BB | 11.9 tok/s | 53.9 GB | ||
| BB | 9.2 tok/s | 53.9 GB | ||
| BB | 9.2 tok/s | 53.9 GB | ||
| BB | 9.2 tok/s | 53.9 GB | ||
| BB | 8.2 tok/s | 53.9 GB | ||
| BB | 6.0 tok/s | 53.9 GB | ||
| BB | 11.9 tok/s | 53.9 GB | ||
| BB | 8.2 tok/s | 53.9 GB | ||
| BB | 8.2 tok/s | 53.9 GB | ||
| BB | 8.2 tok/s | 53.9 GB | ||
| BB | 4.1 tok/s | 53.9 GB | ||
| BB | 12.2 tok/s | 53.9 GB | ||
| BB | 12.2 tok/s | 53.9 GB | ||
| BB | 6.0 tok/s | 53.9 GB | ||
| BB | 4.6 tok/s | 53.9 GB |
Qwen3-32B is a dense, 32.8-billion parameter large language model developed by Alibaba Cloud. Positioned as a mid-sized powerhouse, it is designed to deliver high-tier reasoning and coding performance typically reserved for models twice its size. As the successor to the widely adopted Qwen2.5 series, Qwen3-32B introduces a hybrid thinking mode that allows it to toggle between standard fast inference and deeper "chain-of-thought" reasoning.
For practitioners, this model occupies a "Goldilocks" zone in the 2025 local AI landscape. It is small enough to run on high-end consumer hardware like the NVIDIA RTX 4090 with appropriate quantization, yet powerful enough to handle complex instruction-following and agentic workflows. In specific benchmarks like LiveCodeBench, Qwen3-32B has demonstrated performance that surpasses OpenAI’s o1-preview, making it a top-tier choice for developers who need a local AI model with 32.8B parameters that doesn't compromise on logic or math.
The architecture of Qwen3-32B is a standard dense transformer. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, every one of the 32.8 billion parameters in Qwen3-32B is utilized for every token generated. This results in higher "reasoning density" and more consistent performance across diverse tasks, though it requires more VRAM than an MoE of the same active parameter count.
Key technical specifications include:
The 131,072-token context length is a critical feature for local practitioners. It allows for the ingestion of entire codebases, long technical documents, or extensive chat histories without the immediate need for complex RAG (Retrieval-Augmented Generation) pipelines. The model utilizes RoPE (Rotary Positional Embedding) scaling to maintain coherence across this massive window, ensuring that Qwen3-32B performance remains stable even as the KV cache fills up.
Qwen3-32B is a general-purpose model with a heavy lean toward technical and multilingual proficiency. It is trained on a diverse dataset spanning 119 languages, making it one of the most capable multilingual models available for local deployment.
The model's standout feature is its "thinking" capability. It excels at multi-step logic puzzles and complex mathematical proofs. In local environments, this makes it ideal for autonomous agents that need to plan, verify their own steps, and correct errors before delivering a final output.
For developers, this model is a significant upgrade over previous iterations. It supports dozens of programming languages and is particularly adept at:
With support for 119 languages, Qwen3-32B is a primary candidate for translation tasks and localized content generation. It handles non-Latin scripts and low-resource languages with significantly higher perplexity scores than many Western-centric models of similar size.
To run Qwen3-32B locally, your primary constraint will be VRAM. Because this is a dense 32.8B model, the memory footprint is substantial compared to 7B or 14B models.
For most practitioners, the best quantization for Qwen3-32B is Q4_K_M (GGUF) or 4-bit EXL2. At 4-bit quantization, the model retains over 98% of its original intelligence while fitting comfortably within the 24GB VRAM buffer of a consumer flagship GPU. If you are using an Apple Silicon Mac with unified memory, Q6_K or Q8_0 is recommended for maximum precision, provided you have at least 48GB of total RAM.
The best GPU for Qwen3-32B in a single-card setup is the NVIDIA RTX 4090. It provides the necessary 24GB of VRAM and the high memory bandwidth required to maintain usable inference speeds.
On Apple hardware, an M2/M3/M4 Max with 64GB or more of Unified Memory is the ideal platform. This allows you to run higher-precision versions of the model while leaving enough overhead for the OS and the 131K context window's KV cache.
Qwen3-32B tokens per second will vary based on your hardware and quantization:
The quickest way to get started is using Ollama. Simply run ollama run qwen3:32b to download the default 4-bit quant and begin interacting with the model via CLI or local web UIs like Open WebUI.
When evaluating Qwen3-32B, it is most frequently compared to Mistral Small (22B/24B) and Llama 3.1 70B.
Mistral Small is more efficient in terms of VRAM, fitting easily into 16GB cards at 4-bit. However, Qwen3-32B significantly outperforms Mistral Small in coding and complex reasoning benchmarks. If your hardware can handle the extra 6-8GB of VRAM, Qwen3-32B is the superior choice for technical workloads.
Llama 3.1 70B is the industry standard for open-weight models, but it is much harder to run locally. A 70B model requires dual 3090/4090s even at 4-bit. Qwen3-32B manages to match or exceed Llama 3.1 70B in several Qwen3-32B reasoning benchmarks and coding tasks while being half the size. For users with a single 24GB GPU, Qwen3-32B provides "70B-class" intelligence without the need for a multi-GPU rig.
The leap from Qwen2.5 to Qwen3 is primarily in the "thinking" architecture. While Qwen2.5 was an excellent generalist, Qwen3-32B is far more capable of handling long-form reasoning and "chain-of-thought" prompts. It is less prone to hallucinations in math and logic-heavy tasks, making it a more reliable partner for local development and agentic automation.