made by agents

Alibaba's large-medium MoE model with 122B total / 10B active parameters. Leads on agentic benchmarks including BFCL-V4 and BrowseComp. Natively multimodal with 262K context.
Copy and paste this command to start running the model locally.
ollama run qwen3.5:122bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 25.2 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 27.3 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 28.3 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 29.5 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 32.0 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 41.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA A100 SXM4 80GBNVIDIA | SS | 60.2 tok/s | 27.3 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 98.9 tok/s | 27.3 GB | |
Google Cloud TPU v5pGoogle | SS | 81.6 tok/s | 27.3 GB | |
| SS | 72.3 tok/s | 27.3 GB | ||
| SS | 109.2 tok/s | 27.3 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 141.7 tok/s | 27.3 GB | |
| SS | 156.4 tok/s | 27.3 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 236.1 tok/s | 27.3 GB | |
| SS | 48.4 tok/s | 27.3 GB | ||
| SS | 52.9 tok/s | 27.3 GB | ||
| SS | 177.1 tok/s | 27.3 GB | ||
| SS | 236.1 tok/s | 27.3 GB | ||
| SS | 28.3 tok/s | 27.3 GB | ||
NVIDIA L40SNVIDIA | AA | 25.5 tok/s | 27.3 GB | |
| AA | 23.6 tok/s | 27.3 GB | ||
| AA | 23.6 tok/s | 27.3 GB | ||
| AA | 24.2 tok/s | 27.3 GB | ||
| AA | 24.2 tok/s | 27.3 GB | ||
| AA | 18.1 tok/s | 27.3 GB | ||
| AA | 18.1 tok/s | 27.3 GB | ||
| AA | 18.1 tok/s | 27.3 GB | ||
| AA | 11.8 tok/s | 27.3 GB | ||
| BB | 16.1 tok/s | 27.3 GB | ||
| BB | 16.1 tok/s | 27.3 GB | ||
| BB | 16.1 tok/s | 27.3 GB |
Qwen3.5-122B-A10B is Alibaba Cloud’s flagship Mixture of Experts (MoE) model designed to bridge the gap between mid-sized dense models and massive frontier models. With 122 billion total parameters and 10 billion active parameters per token, it strikes a specific balance: the reasoning depth of a 100B+ parameter model with the inference speed of a much smaller 10B parameter architecture.
Released in early 2025, this model is specifically optimized for agentic workflows. It currently leads on key industry benchmarks including BFCL-V4 (Berkeley Function Calling Leaderboard) and BrowseComp, making it a primary choice for developers building autonomous agents that need to navigate the web or interact with complex APIs. Unlike many of its predecessors, Qwen3.5-122B-A10B is natively multimodal and features a massive 262,144-token context window, allowing for the ingestion of entire codebases or lengthy technical documents in a single prompt.
For practitioners looking to run Qwen3.5-122B-A10B locally, the primary hurdle is VRAM capacity rather than raw compute power. Because it is an MoE model, it requires enough memory to house all 122B parameters, but it executes with the efficiency of a much smaller model, resulting in high Qwen3.5-122B-A10B tokens per second even on prosumer hardware.
The "A10B" in the model name signifies that only 10 billion parameters are activated during the forward pass for any given token. This Qwen3.5-122B-A10B MoE efficiency is what makes the model viable for local deployment. While a dense 122B model would be agonizingly slow on anything but a multi-H100 cluster, the MoE architecture allows the model to "route" queries to the most relevant specialized sub-networks (experts).
When evaluating Qwen3.5-122B-A10B hardware requirements, you must distinguish between memory bandwidth and memory capacity.
The 262k context length is supported by an advanced RoPE (Rotary Positional Embedding) scaling implementation, ensuring that the model maintains coherence even at the tail end of a massive prompt. Being natively multimodal, the model does not rely on a separate vision encoder "bolted on" to the LLM; the vision and text capabilities are interleaved, which improves performance in tasks like OCR, chart reasoning, and visual document understanding.
Qwen3.5-122B-A10B is not a general-purpose "chatbot" in the traditional sense; it is a high-reasoning engine built for production-grade tasks.
On the Qwen3.5-122B-A10B reasoning benchmark suites, the model excels at multi-step problem solving. This makes it ideal for:
This is the model's standout feature. Because it leads on BFCL-V4, it is highly reliable at:
Alibaba has trained the Qwen series on a massive corpus of non-English data. It remains one of the best models for Chinese, Japanese, Korean, and various European languages, maintaining high nuance and cultural context that Western-centric models often miss.
To run Qwen3.5-122B-A10B locally, your primary concern is the 122B parameter count. Even though it's efficient to run, it is "heavy" to store.
The Qwen3.5-122B-A10B VRAM requirements vary wildly based on your choice of quantization.
| Precision | VRAM Required | Recommended Hardware |
| :--- | :--- | :--- |
| FP16 | ~245 GB | 4x A100 (80GB) or 8x RTX 6000 Ada |
| Q8_0 | ~130 GB | 2x Mac Studio M2/M3 Ultra (192GB) or 6x RTX 3090/4090 |
| Q4_K_M | ~78 GB | 1x Mac Studio (128GB+) or 4x RTX 3090/4090 (24GB) |
| IQ3_M | ~55 GB | 3x RTX 3090/4090 or Mac Studio (64GB+) |
The best quantization for Qwen3.5-122B-A10B for most practitioners is Q4_K_M. This provides a near-indistinguishable loss in perplexity compared to FP16 while fitting into the memory footprint of common multi-GPU or high-memory Mac setups.
llama.cpp or vLLM is the gold standard. This provides 96GB of VRAM, leaving enough room for the 78GB model plus a sizeable KV cache for long-context tasks.The fastest way to test this model is via Ollama. Once you have the necessary RAM/VRAM, you can run:
ollama run qwen3.5:122b
This will default to a quantized version suitable for your hardware.
When choosing a local AI model 122B parameters 2025 edition, you are likely looking at a few specific competitors.
Llama 3.3 70B is a dense model. While it has fewer total parameters, it requires significantly more compute per token than Qwen's 10B active parameters.
DeepSeek-V3 is a larger MoE (671B total).
Qwen3.5-122B-A10B represents the current ceiling for what a high-end consumer workstation can reasonably handle while maintaining frontier-level performance in coding and tool use. If your workflow involves long-context document analysis or autonomous agents, and you have the VRAM to support it, this is currently the most capable model in the 100B-150B MoE class.