made by agents

Alibaba's flagship Qwen3 MoE model with 235B total / 22B active parameters. Hybrid thinking/non-thinking modes. Trained on 36T tokens across 119 languages. Competes with DeepSeek-R1 and o1.
Copy and paste this command to start running the model locally.
ollama run qwen3:235b-a22bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 31.7 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 36.3 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 38.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 41.2 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 46.7 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 67.6 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA A100 SXM4 80GBNVIDIA | SS | 45.2 tok/s | 36.3 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 74.2 tok/s | 36.3 GB | |
Google Cloud TPU v5pGoogle | SS | 61.3 tok/s | 36.3 GB | |
| SS | 54.3 tok/s | 36.3 GB | ||
| SS | 82.0 tok/s | 36.3 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 106.4 tok/s | 36.3 GB | |
| SS | 117.4 tok/s | 36.3 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 177.3 tok/s | 36.3 GB | |
| SS | 133.0 tok/s | 36.3 GB | ||
| SS | 177.3 tok/s | 36.3 GB | ||
| AA | 21.3 tok/s | 36.3 GB | ||
NVIDIA L40SNVIDIA | AA | 19.1 tok/s | 36.3 GB | |
| AA | 17.7 tok/s | 36.3 GB | ||
| AA | 17.7 tok/s | 36.3 GB | ||
| AA | 8.9 tok/s | 36.3 GB | ||
| BB | 13.6 tok/s | 36.3 GB | ||
| BB | 13.6 tok/s | 36.3 GB | ||
| BB | 13.6 tok/s | 36.3 GB | ||
| BB | 6.8 tok/s | 36.3 GB | ||
| BB | 18.1 tok/s | 36.3 GB | ||
| BB | 18.1 tok/s | 36.3 GB | ||
| BB | 12.1 tok/s | 36.3 GB | ||
| BB | 12.1 tok/s | 36.3 GB | ||
| BB | 12.1 tok/s | 36.3 GB | ||
| BB | 12.1 tok/s | 36.3 GB |
Qwen3-235B-A22B is Alibaba Cloud’s flagship Mixture of Experts (MoE) model, designed to compete directly with frontier-class reasoning models like DeepSeek-R1 and OpenAI’s o1. Released as a significant upgrade to the Qwen series, this model is built on a massive 36-trillion token dataset covering 119 languages. With 235 billion total parameters and 22 billion active parameters per token, it represents a strategic balance between high-capacity knowledge storage and computational efficiency during inference.
For local practitioners, Qwen3-235B-A22B is a heavyweight contender that requires substantial hardware but offers state-of-the-art performance in reasoning, mathematics, and multilingual instruction following. Unlike smaller dense models, this local AI model 235B parameters 2025 release utilizes its MoE architecture to provide the logic of a 200B+ parameter model while maintaining the inference latency more typical of a 20B-30B parameter model, provided the weights can be fit into accessible VRAM.
The core of the Qwen3-235B-A22B is its Mixture of Experts (MoE) architecture. In a standard dense model, every parameter is activated for every token generated. In this MoE implementation, only 22 billion of the 235 billion parameters are "active" for any given computation.
When you run Qwen3-235B-A22B locally, it is vital to understand the distinction between storage and compute. While the Qwen3-235B-A22B MoE efficiency allows for faster token generation (as only 22B parameters are being crunched by the GPU), the entire 235B parameter set must still reside in VRAM. This means the hardware barrier to entry is determined by the total parameter count, while the "tokens per second" (TPS) is influenced by the active parameter count.
Qwen3-235B-A22B is not a general-purpose chatbot; it is a reasoning engine. It features a hybrid thinking/non-thinking mode, allowing users to toggle between standard responses and "Chain of Thought" (CoT) processing for complex tasks.
The model excels in Qwen3-235B-A22B reasoning benchmarks, particularly in multi-step logic problems and high-level mathematics. For engineers, this translates to a tool capable of architectural planning and debugging complex system interactions that smaller models (like Llama 3.1 70B) often fail to grasp.
In software development, the model provides high-tier performance in:
Beyond technical utility, the model is one of the most capable multilingual models available for local use. It handles nuanced translation and creative writing in non-English languages with significantly higher fidelity than Western-centric models of similar size.
Running a model of this scale is a hardware-intensive endeavor. You cannot run this on a single consumer GPU without aggressive quantization that may degrade performance.
To determine the best GPU for Qwen3-235B-A22B, you must first select your quantization level. VRAM requirements are roughly calculated as (Parameters * Bits) / 8 * 1.2 (including overhead for context).
Because only 22B parameters are active, Qwen3-235B-A22B tokens per second are surprisingly high for a model of this size. On a quad-4090 setup using llama.cpp, you can expect 5–12 tokens per second depending on quantization and context usage. On Apple Silicon (M2 Ultra), speeds generally hover around 3–6 tokens per second.
The fastest how to run 235B model on consumer GPU setups is via Ollama. If you have the required VRAM, you can pull the model directly:
ollama run qwen3:235b
Ollama will automatically handle the MoE offloading, but ensure your system has enough swap space if you are pushing the limits of your physical VRAM.
When evaluating Qwen3-235B-A22B, it is most often compared to DeepSeek-R1 and Llama 3.1 405B.
DeepSeek-R1 (671B total / 37B active) is a larger model that generally outperforms Qwen3 in pure "thinking" tasks and PhD-level reasoning. However, Qwen3-235B-A22B is significantly easier to fit on prosumer hardware. While DeepSeek-R1 requires nearly 400GB of VRAM for a reasonable 4-bit quantization, Qwen3 fits into ~150GB, making it much more viable for local workstations.
Llama 3.1 405B is a dense model, meaning every parameter is used for every token. This makes Llama 405B significantly slower for local inference than Qwen3-235B-A22B. While Llama may have a slight edge in general English-language instruction following, Qwen3 wins on inference speed and multilingual capabilities.
You should choose this model if you need frontier-level reasoning and have a multi-GPU or high-RAM Mac setup, but cannot justify the massive 400GB+ VRAM requirement of models like DeepSeek-R1 or Llama 405B. It is the "middle ground" of the ultra-large model category—offering the power of a 200B+ model with the operational footprint of a much smaller one.