Alibaba

Qwen3-30B-A3B

Compact MoE: 30B total / 3B active. Comparable to the larger Qwen3-32B dense model with 10% of the active parameters.

30B paramsMoE131K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A strong 30B-parameter MoE language model from Alibaba. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~5.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length131K tokens

ModalityText Only

Training Cutoff2024

ProviderAlibaba

Download Size61.1 GB

Community

Monthly Downloads1.7M

Likes887

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3:30b-a3b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

72.6

Overall Score

70.5AA

Benchmark40%

72.6

Popularity25%

49.2

Efficiency20%

98.4

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	5.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	5.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	6.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	6.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	9.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	SS	43.0 tok/s	5.4 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	40.7 tok/s	5.4 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	67.0 tok/s	5.4 GB
AMD Radeon RX 7700 XTAMD	SS	64.6 tok/s	5.4 GB
Intel Arc B580Intel	SS	68.2 tok/s	5.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	75.3 tok/s	5.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	75.3 tok/s	5.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	100.4 tok/s	5.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	76.5 tok/s	5.4 GB
AMD Radeon RX 7800 XTAMD	SS	93.3 tok/s	5.4 GB
AMD Radeon RX 9070AMD	SS	95.7 tok/s	5.4 GB
AMD Radeon RX 9070 XTAMD	SS	95.7 tok/s	5.4 GB
Google Cloud TPU v5eGoogle	SS	122.4 tok/s	5.4 GB
Intel Arc A770 16GBIntel	SS	83.7 tok/s	5.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	143.5 tok/s	5.4 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	SS	43.0 tok/s	5.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	100.4 tok/s	5.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	110.0 tok/s	5.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	67.0 tok/s	5.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	133.9 tok/s	5.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	143.5 tok/s	5.4 GB
AMD Radeon RX 7900 XTAMD	SS	119.6 tok/s	5.4 GB
AMD Radeon RX 7900 XTXAMD	SS	143.5 tok/s	5.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	139.9 tok/s	5.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	150.7 tok/s	5.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~43 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3-30B-A3B on AMD Radeon RX 7600 8GB · ~43 tok/s · 165W	$0.128
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Qwen3-30B-A3B is a Mixture of Experts (MoE) large language model released by Alibaba Cloud’s Qwen team. It represents a significant shift toward inference efficiency for local practitioners, offering a 30B parameter knowledge base with the computational overhead of a much smaller model. By activating only 3 billion parameters per token, the model achieves a high throughput-to-intelligence ratio, making it a primary candidate for developers who need more reasoning capability than an 8B model provides but lack the multi-GPU clusters required for dense 70B+ models.

As a local AI model 30B parameters 2025 release, Qwen3-30B-A3B is designed to compete directly with dense models like Qwen3-32B and Mistral Small. While its dense counterparts require full parameter computation for every generation, the Qwen3-30B-A3B MoE efficiency allows it to maintain comparable performance on logic and coding benchmarks while significantly reducing the latency per token. This makes it particularly attractive for real-time applications such as local agents, interactive coding assistants, and complex function-calling workflows.

Architecture & Technical Details

The core of Qwen3-30B-A3B is its Mixture of Experts architecture. In a traditional dense model, every parameter is utilized for every token generated. In this MoE configuration, the model consists of 30 billion total parameters, but a gating mechanism routes each token to only a specific subset of "experts," resulting in only 3 billion active parameters per inference step.

For the local developer, this architecture creates a unique hardware profile. While the Qwen3-30B-A3B VRAM requirements are dictated by the total 30B parameter count (as the entire model must reside in memory), the Qwen3-30B-A3B tokens per second are more reflective of a 3B parameter model. This decoupling of memory capacity from compute intensity is the model's primary technical advantage.

Key technical specifications include:

Context Length: 131,072 tokens, supported by advanced RoPE (Rotary Positional Embeddings) scaling.
Architecture: MoE with 3B active parameters.
Training Data: Cutoff in 2024, incorporating a massive multilingual and code-heavy corpus.
License: Apache 2.0, allowing for permissive commercial use and modification.

The 131k context window is a critical feature for local RAG (Retrieval-Augmented Generation) applications. It allows practitioners to ingest entire codebases or long technical documents without the aggressive "lost in the middle" phenomena seen in smaller context models.

Capabilities & Use Cases

Qwen3-30B-A3B is a general-purpose model with a heavy emphasis on technical reasoning. Unlike many smaller models that struggle with multi-step logic, this model performs exceptionally well in the following areas:

Advanced Coding and Debugging

Qwen3-30B-A3B for coding is one of the most common use cases for this model. It supports over 30 programming languages and excels at boilerplate generation, refactoring, and identifying logical errors in complex scripts. Because it was trained on a 2024 dataset, it has a better grasp of modern frameworks and API versions than older 30B-class models.

Function Calling and Agentic Workflows

The model is fine-tuned for high-reliability tool use. In local agent setups, it can accurately format JSON and decide when to call external functions. Its ability to follow system prompts strictly makes it a reliable "brain" for local automation tasks where a 7B or 8B model might hallucinate the tool syntax.

Multilingual Reasoning and Math

Alibaba’s Qwen series consistently leads in multilingual benchmarks. Qwen3-30B-A3B handles non-English languages—particularly CJK (Chinese, Japanese, Korean) and European languages—with a level of nuance usually reserved for much larger models. On the Qwen3-30B-A3B reasoning benchmark, it shows strong performance in mathematical problem solving, making it suitable for local data analysis and STEM-related tutoring applications.

Running Qwen3-30B-A3B Locally

To run Qwen3-30B-A3B locally, your primary bottleneck will be VRAM. Because MoE models require all experts to be loaded into memory to prevent massive latency hits from disk swapping, you must have enough VRAM to hold the 30B parameters.

Qwen3-30B-A3B Hardware Requirements

Minimum VRAM (4-bit quantization): ~18GB - 20GB.
Recommended VRAM (6-bit or 8-bit): 24GB - 32GB+.
System RAM: If offloading to CPU (not recommended for MoE), at least 64GB.

The best GPU for Qwen3-30B-A3B is an NVIDIA RTX 3090 or 4090 with 24GB of VRAM. These cards allow you to run the model at 4-bit or 5-bit quantization with enough headroom for a decent context buffer. For Mac users, an M2/M3/M4 Max or Ultra with at least 32GB of Unified Memory provides a seamless experience.

Recommended Quantization

The best quantization for Qwen3-30B-A3B for most practitioners is Q4_K_M (GGUF) or 4-bit (EXL2/AWQ).

Q4_K_M: Offers the best balance between perplexity (intelligence) and memory footprint. It fits comfortably on a 24GB GPU.
Q6_K: If you have 32GB+ of VRAM (e.g., dual 3060 12GBs or a Mac), the jump to 6-bit provides a measurable increase in reasoning accuracy for complex coding tasks.

Performance Expectations

On an RTX 4090 using the Q4_K_M quantization via Ollama or llama.cpp, you can expect Qwen3-30B-A3B tokens per second in the range of 40-60 t/s. This is significantly faster than a dense 30B model, which would typically hover around 15-25 t/s on the same hardware.

How to run 30B model on consumer GPU

If you are using a 16GB VRAM card (like an RTX 4080 or 4070 Ti Super), you will need to use a lower quantization like IQ3_M or Q3_K_L. While this will fit, you may notice a slight degradation in complex reasoning. For these cards, it is often better to use a smaller dense model or accept the speed penalty of partial CPU offloading.

The quickest way to get started is via Ollama. Simply run:

ollama run qwen3:30b-a3b

How It Compares

When deciding whether to deploy this model, it is helpful to look at Qwen3-30B-A3B vs Mistral Small and its dense sibling, Qwen3-32B.

Qwen3-30B-A3B vs Mistral Small

Mistral Small is a highly optimized dense model. While Mistral Small often exhibits slightly better English-language creative writing, Qwen3-30B-A3B typically wins in coding and math benchmarks. Furthermore, the MoE architecture of the Qwen model allows for higher throughput on mid-range hardware compared to the dense Mistral Small.

Qwen3-30B-A3B vs Qwen3-32B (Dense)

The dense 32B model is essentially the "full weight" version of this intelligence class.

Intelligence: The dense 32B model has a slight edge in very niche edge cases and deep nuances due to every parameter being active.
Speed: Qwen3-30B-A3B is roughly 2x to 3x faster in generation speed because it only calculates 3B parameters per token.
VRAM: Both require approximately the same VRAM (~18-20GB for 4-bit).

For local practitioners, the choice usually comes down to the "latency tax." If you are running an interactive chat or an agent that needs to make many quick decisions, Qwen3-30B-A3B is the superior choice. If you are performing offline batch processing where speed doesn't matter, the dense Qwen3-32B may offer slightly more stability in its outputs.

In the current landscape of local AI, Qwen3-30B-A3B stands out as a highly pragmatic middle ground. It provides the "large model" feel of high-parameter reasoning without the "large model" wait times, provided you have the 24GB of VRAM required to house its expert library.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Alibaba

Qwen3-30B-A3B

Compact MoE: 30B total / 3B active. Comparable to the larger Qwen3-32B dense model with 10% of the active parameters.

30B paramsMoE131K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A strong 30B-parameter MoE language model from Alibaba. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~5.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length131K tokens

ModalityText Only

Training Cutoff2024

ProviderAlibaba

Download Size61.1 GB

Community

Monthly Downloads1.7M

Likes887

Last Updated10 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3:30b-a3b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Arena Score

72.6

Overall Score

70.5AA

Benchmark40%

72.6

Popularity25%

49.2

Efficiency20%

98.4

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	5.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	5.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	6.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	6.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	9.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	SS	43.0 tok/s	5.4 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	40.7 tok/s	5.4 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	67.0 tok/s	5.4 GB
AMD Radeon RX 7700 XTAMD	SS	64.6 tok/s	5.4 GB
Intel Arc B580Intel	SS	68.2 tok/s	5.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	75.3 tok/s	5.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	75.3 tok/s	5.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	100.4 tok/s	5.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	76.5 tok/s	5.4 GB
AMD Radeon RX 7800 XTAMD	SS	93.3 tok/s	5.4 GB
AMD Radeon RX 9070AMD	SS	95.7 tok/s	5.4 GB
AMD Radeon RX 9070 XTAMD	SS	95.7 tok/s	5.4 GB
Google Cloud TPU v5eGoogle	SS	122.4 tok/s	5.4 GB
Intel Arc A770 16GBIntel	SS	83.7 tok/s	5.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	143.5 tok/s	5.4 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	SS	43.0 tok/s	5.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	100.4 tok/s	5.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	110.0 tok/s	5.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	67.0 tok/s	5.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	133.9 tok/s	5.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	143.5 tok/s	5.4 GB
AMD Radeon RX 7900 XTAMD	SS	119.6 tok/s	5.4 GB
AMD Radeon RX 7900 XTXAMD	SS	143.5 tok/s	5.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	139.9 tok/s	5.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	150.7 tok/s	5.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~43 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Qwen3-30B-A3B on AMD Radeon RX 7600 8GB · ~43 tok/s · 165W	$0.128
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key technical specifications include:

Context Length: 131,072 tokens, supported by advanced RoPE (Rotary Positional Embeddings) scaling.
Architecture: MoE with 3B active parameters.
Training Data: Cutoff in 2024, incorporating a massive multilingual and code-heavy corpus.
License: Apache 2.0, allowing for permissive commercial use and modification.

Capabilities & Use Cases

Advanced Coding and Debugging

Function Calling and Agentic Workflows

Multilingual Reasoning and Math

Running Qwen3-30B-A3B Locally

Qwen3-30B-A3B Hardware Requirements

Minimum VRAM (4-bit quantization): ~18GB - 20GB.
Recommended VRAM (6-bit or 8-bit): 24GB - 32GB+.
System RAM: If offloading to CPU (not recommended for MoE), at least 64GB.

Recommended Quantization

The best quantization for Qwen3-30B-A3B for most practitioners is Q4_K_M (GGUF) or 4-bit (EXL2/AWQ).

Q4_K_M: Offers the best balance between perplexity (intelligence) and memory footprint. It fits comfortably on a 24GB GPU.
Q6_K: If you have 32GB+ of VRAM (e.g., dual 3060 12GBs or a Mac), the jump to 6-bit provides a measurable increase in reasoning accuracy for complex coding tasks.

Performance Expectations

How to run 30B model on consumer GPU

The quickest way to get started is via Ollama. Simply run:

ollama run qwen3:30b-a3b

How It Compares

When deciding whether to deploy this model, it is helpful to look at Qwen3-30B-A3B vs Mistral Small and its dense sibling, Qwen3-32B.

Qwen3-30B-A3B vs Mistral Small

Qwen3-30B-A3B vs Qwen3-32B (Dense)

The dense 32B model is essentially the "full weight" version of this intelligence class.

Intelligence: The dense 32B model has a slight edge in very niche edge cases and deep nuances due to every parameter being active.
Speed: Qwen3-30B-A3B is roughly 2x to 3x faster in generation speed because it only calculates 3B parameters per token.
VRAM: Both require approximately the same VRAM (~18-20GB for 4-bit).