made by agents

Compact MoE: 30B total / 3B active. Comparable to the larger Qwen3-32B dense model with 10% of the active parameters.
Copy and paste this command to start running the model locally.
ollama run qwen3:30b-a3bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 4.8 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 5.4 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 5.7 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 6.0 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 6.8 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 9.6 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 43.0 tok/s | 5.4 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | SS | 40.7 tok/s | 5.4 GB | |
| SS | 67.0 tok/s | 5.4 GB | ||
| SS | 64.6 tok/s | 5.4 GB | ||
Intel Arc B580Intel | SS | 68.2 tok/s | 5.4 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 75.3 tok/s | 5.4 GB | |
| SS | 75.3 tok/s | 5.4 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 100.4 tok/s | 5.4 GB | |
| SS | 93.3 tok/s | 5.4 GB | ||
| SS | 95.7 tok/s | 5.4 GB | ||
| SS | 95.7 tok/s | 5.4 GB | ||
Google Cloud TPU v5eGoogle | SS | 122.4 tok/s | 5.4 GB | |
Intel Arc A770 16GBIntel | SS | 83.7 tok/s | 5.4 GB | |
| SS | 43.0 tok/s | 5.4 GB | ||
| SS | 100.4 tok/s | 5.4 GB | ||
| SS | 110.0 tok/s | 5.4 GB | ||
| SS | 67.0 tok/s | 5.4 GB | ||
| SS | 133.9 tok/s | 5.4 GB | ||
| SS | 143.5 tok/s | 5.4 GB | ||
| SS | 119.6 tok/s | 5.4 GB | ||
| SS | 143.5 tok/s | 5.4 GB | ||
| SS | 150.7 tok/s | 5.4 GB | ||
| SS | 44.8 tok/s | 5.4 GB | ||
| SS | 245.1 tok/s | 5.4 GB | ||
| SS | 267.8 tok/s | 5.4 GB |
Qwen3-30B-A3B is a Mixture of Experts (MoE) large language model released by Alibaba Cloud’s Qwen team. It represents a significant shift toward inference efficiency for local practitioners, offering a 30B parameter knowledge base with the computational overhead of a much smaller model. By activating only 3 billion parameters per token, the model achieves a high throughput-to-intelligence ratio, making it a primary candidate for developers who need more reasoning capability than an 8B model provides but lack the multi-GPU clusters required for dense 70B+ models.
As a local AI model 30B parameters 2025 release, Qwen3-30B-A3B is designed to compete directly with dense models like Qwen3-32B and Mistral Small. While its dense counterparts require full parameter computation for every generation, the Qwen3-30B-A3B MoE efficiency allows it to maintain comparable performance on logic and coding benchmarks while significantly reducing the latency per token. This makes it particularly attractive for real-time applications such as local agents, interactive coding assistants, and complex function-calling workflows.
The core of Qwen3-30B-A3B is its Mixture of Experts architecture. In a traditional dense model, every parameter is utilized for every token generated. In this MoE configuration, the model consists of 30 billion total parameters, but a gating mechanism routes each token to only a specific subset of "experts," resulting in only 3 billion active parameters per inference step.
For the local developer, this architecture creates a unique hardware profile. While the Qwen3-30B-A3B VRAM requirements are dictated by the total 30B parameter count (as the entire model must reside in memory), the Qwen3-30B-A3B tokens per second are more reflective of a 3B parameter model. This decoupling of memory capacity from compute intensity is the model's primary technical advantage.
Key technical specifications include:
The 131k context window is a critical feature for local RAG (Retrieval-Augmented Generation) applications. It allows practitioners to ingest entire codebases or long technical documents without the aggressive "lost in the middle" phenomena seen in smaller context models.
Qwen3-30B-A3B is a general-purpose model with a heavy emphasis on technical reasoning. Unlike many smaller models that struggle with multi-step logic, this model performs exceptionally well in the following areas:
Qwen3-30B-A3B for coding is one of the most common use cases for this model. It supports over 30 programming languages and excels at boilerplate generation, refactoring, and identifying logical errors in complex scripts. Because it was trained on a 2024 dataset, it has a better grasp of modern frameworks and API versions than older 30B-class models.
The model is fine-tuned for high-reliability tool use. In local agent setups, it can accurately format JSON and decide when to call external functions. Its ability to follow system prompts strictly makes it a reliable "brain" for local automation tasks where a 7B or 8B model might hallucinate the tool syntax.
Alibaba’s Qwen series consistently leads in multilingual benchmarks. Qwen3-30B-A3B handles non-English languages—particularly CJK (Chinese, Japanese, Korean) and European languages—with a level of nuance usually reserved for much larger models. On the Qwen3-30B-A3B reasoning benchmark, it shows strong performance in mathematical problem solving, making it suitable for local data analysis and STEM-related tutoring applications.
To run Qwen3-30B-A3B locally, your primary bottleneck will be VRAM. Because MoE models require all experts to be loaded into memory to prevent massive latency hits from disk swapping, you must have enough VRAM to hold the 30B parameters.
The best GPU for Qwen3-30B-A3B is an NVIDIA RTX 3090 or 4090 with 24GB of VRAM. These cards allow you to run the model at 4-bit or 5-bit quantization with enough headroom for a decent context buffer. For Mac users, an M2/M3/M4 Max or Ultra with at least 32GB of Unified Memory provides a seamless experience.
The best quantization for Qwen3-30B-A3B for most practitioners is Q4_K_M (GGUF) or 4-bit (EXL2/AWQ).
On an RTX 4090 using the Q4_K_M quantization via Ollama or llama.cpp, you can expect Qwen3-30B-A3B tokens per second in the range of 40-60 t/s. This is significantly faster than a dense 30B model, which would typically hover around 15-25 t/s on the same hardware.
If you are using a 16GB VRAM card (like an RTX 4080 or 4070 Ti Super), you will need to use a lower quantization like IQ3_M or Q3_K_L. While this will fit, you may notice a slight degradation in complex reasoning. For these cards, it is often better to use a smaller dense model or accept the speed penalty of partial CPU offloading.
The quickest way to get started is via Ollama. Simply run:
ollama run qwen3:30b-a3b
When deciding whether to deploy this model, it is helpful to look at Qwen3-30B-A3B vs Mistral Small and its dense sibling, Qwen3-32B.
Mistral Small is a highly optimized dense model. While Mistral Small often exhibits slightly better English-language creative writing, Qwen3-30B-A3B typically wins in coding and math benchmarks. Furthermore, the MoE architecture of the Qwen model allows for higher throughput on mid-range hardware compared to the dense Mistral Small.
The dense 32B model is essentially the "full weight" version of this intelligence class.
For local practitioners, the choice usually comes down to the "latency tax." If you are running an interactive chat or an agent that needs to make many quick decisions, Qwen3-30B-A3B is the superior choice. If you are performing offline batch processing where speed doesn't matter, the dense Qwen3-32B may offer slightly more stability in its outputs.
In the current landscape of local AI, Qwen3-30B-A3B stands out as a highly pragmatic middle ground. It provides the "large model" feel of high-parameter reasoning without the "large model" wait times, provided you have the 24GB of VRAM required to house its expert library.