Z.ai's open-weight flagship, a 744B-parameter Mixture-of-Experts model with 40B active per token and a native 1M-token context. Activates 8 of 256 experts per token and uses DeepSeek Sparse Attention with an IndexShare scheme that cuts per-token FLOPs by 2.9x at 1M context. Built for long-horizon coding and agentic work, with selectable reasoning effort (high/max, or thinking off). Scores 62.1 on SWE-bench Pro, 81.0 on Terminal-Bench 2.1, 74.4 on FrontierSWE, 91.2 on GPQA-Diamond, and 99.2 on AIME 2026, beating GPT-5.5 on several long-horizon coding benchmarks at a fraction of the cost. Released under the MIT license.
A workable 744B-parameter MoE language model from Z.ai. Pulls ahead on graduate-level reasoning (GPQA) (90/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 335.3 GB | Low | |
| Q4_K_MRecommended | 343.7 GB | Good | |
| Q5_K_M | 347.7 GB | Very Good | |
| Q6_K | 352.5 GB | Excellent | |
| Q8_0 | 362.5 GB | Near Perfect | |
| FP16 | 400.5 GB | Full |
See which devices can run this model and at what quality level.
| AA | 16.6 tok/s | 343.7 GB | ||
| AA | 16.6 tok/s | 343.7 GB | ||
| AA | 16.6 tok/s | 343.7 GB | ||
| AA | 16.6 tok/s | 343.7 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 16.6 tok/s | 343.7 GB | |
Gigabyte W775-V10-L01Gigabyte | AA | 16.6 tok/s | 343.7 GB | |
| BB | 1.9 tok/s | 343.7 GB | ||
| BB | 1.9 tok/s | 343.7 GB | ||
| FF | 1.2 tok/s | 343.7 GB | ||
| FF | 0.6 tok/s | 343.7 GB | ||
| FF | 12.4 tok/s | 343.7 GB | ||
| FF | 14.1 tok/s | 343.7 GB | ||
| FF | 18.7 tok/s | 343.7 GB | ||
| FF | 0.7 tok/s | 343.7 GB | ||
| FF | 1.0 tok/s | 343.7 GB | ||
| FF | 1.5 tok/s | 343.7 GB | ||
| FF | 1.9 tok/s | 343.7 GB | ||
| FF | 2.2 tok/s | 343.7 GB | ||
| FF | 1.5 tok/s | 343.7 GB | ||
| FF | 1.5 tok/s | 343.7 GB | ||
Apple M4Apple | FF | 0.3 tok/s | 343.7 GB | |
| FF | 1.3 tok/s | 343.7 GB | ||
| FF | 0.6 tok/s | 343.7 GB | ||
Apple M5Apple | FF | 0.4 tok/s | 343.7 GB | |
| FF | 1.4 tok/s | 343.7 GB |
Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)GLM-5.2 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W | $2.78 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 344 GB VRAM, refreshed hourly.
No current rental listing covers this model’s VRAM requirement on the providers we track.
GLM-5.2 is Z.ai's open-weight flagship, a 744B-parameter Mixture-of-Experts model that activates only 40B parameters per token. It is built specifically for long-horizon tasks — sustained coding sessions, multi-step agentic workflows, and project-scale reasoning — where maintaining coherence across hundreds of thousands of tokens is the difference between a useful tool and a toy. Released under the MIT license, it competes directly with frontier closed-source models like Claude Opus 4.8 and GPT-5.5 on long-context coding benchmarks while running at a fraction of their cost.
The model uses 8 of 256 experts per token, which means inference throughput is closer to a 40B dense model than a 744B one. This is the key architectural decision that makes GLM-5.2 viable for local deployment despite its total parameter count. Z.ai has also introduced IndexShare, a sparse attention mechanism that reuses the same indexer across every four attention layers, reducing per-token FLOPs by 2.9x at 1M context length. This is not theoretical — it translates directly to faster generation and lower memory pressure when working with long inputs.
GLM-5.2 is a text-only MoE transformer with a native 1,000,000-token context window. The 744B total parameters are distributed across 256 experts, with 8 experts activated per token. This 8/256 activation ratio means the model's compute cost per token is roughly equivalent to a 40B dense model, while retaining the knowledge capacity of the full parameter set.
The attention mechanism uses DeepSeek Sparse Attention with Z.ai's IndexShare scheme. Standard sparse attention requires computing a separate index for each attention layer, which becomes expensive at long contexts. IndexShare reuses a single indexer across groups of four layers, cutting the FLOP cost of attention computation by 2.9x at 1M tokens. For practitioners running long-context workloads, this means lower latency and the ability to fit larger batches into available VRAM.
Z.ai also improved the MTP (Multi-Token Prediction) layer used for speculative decoding. The acceptance length increases by up to 20% compared to GLM-5.1, which improves tokens-per-second during inference — particularly relevant when running on consumer hardware where every efficiency gain matters.
The model supports selectable reasoning effort levels: "high" and "max" for deep reasoning, or "thinking off" for latency-sensitive applications like chat. This gives you control over how much compute the model spends on each generation, which is useful when balancing quality against throughput on a single GPU.
GLM-5.2 is a general-purpose text model with strong performance across coding, reasoning, math, and instruction-following. Its capabilities include chat, code generation, function-calling, multilingual support (English and Chinese), and complex mathematical reasoning.
The model's standout use case is long-horizon coding. On SWE-bench Pro it scores 62.1, on Terminal-Bench 2.1 it scores 81.0, and on FrontierSWE it scores 74.4 — beating GPT-5.5 on all three benchmarks. These tests measure the model's ability to work through multi-step engineering tasks: fixing bugs across a codebase, implementing features with multiple files, and navigating terminal environments. This is not a model that generates a single function and stops — it is designed to sustain coherent reasoning over hours-long agentic sessions.
For reasoning-heavy workloads, GLM-5.2 scores 99.2 on AIME 2026 and 91.2 on GPQA-Diamond. The CritPt benchmark, which tests critical point identification in long technical documents, shows a dramatic improvement over GLM-5.1 (20.9 vs 4.6), indicating that the long-context training paid off in practical document understanding.
The model supports function-calling and tool use, making it suitable for agent frameworks that need to invoke APIs, query databases, or interact with file systems. The 1M context window means you can feed it an entire codebase or a full project specification without chunking.
This is where GLM-5.2's MoE architecture becomes critical. With 40B active parameters, the model does not require the VRAM you would expect for a 744B dense model. However, the full parameter set must still be loaded into memory.
VRAM requirements by quantization:
Consumer hardware reality: No single consumer GPU can run GLM-5.2. A single RTX 4090 (24 GB) or even an M4 Max (128 GB unified memory) is insufficient. To run this model locally, you need multi-GPU setups. A practical configuration is 4x RTX 4090s in a single system, which gives you 96 GB total — enough for Q2_K with careful memory management. For Q4_K_M, you need 8x RTX 4090s or professional hardware like 2x A100 80GB or 4x A6000.
Expected performance: With 4x RTX 4090s using Q2_K quantization, expect 2-5 tokens per second on long contexts. With 8x RTX 4090s at Q4_K_M, expect 5-10 tokens per second. These numbers vary significantly based on context length and the reasoning effort setting.
Ollama is the quickest way to get started. The model is available on HuggingFace as zai-org/GLM-5.2 and can be pulled into Ollama with a custom Modelfile. For production deployments, use llama.cpp or vLLM with tensor parallelism across multiple GPUs.
Best quantization for most users: Q4_K_M if you have the hardware. It offers the best balance of quality and efficiency for coding and reasoning tasks. Drop to Q3_K_M only if you are VRAM-constrained and need the model to fit on fewer cards.
GLM-5.2 vs DeepSeek-V4-Pro: Both are MoE models with similar active parameter counts. DeepSeek-V4-Pro is strong on general reasoning but GLM-5.2 leads on long-horizon coding benchmarks: 62.1 vs 55.4 on SWE-bench Pro, and 81.0 vs 64 on Terminal-Bench 2.1. If your workload is sustained coding sessions or agentic tasks, GLM-5.2 is the better choice. DeepSeek-V4-Pro may edge ahead on some reasoning benchmarks, but the gap is narrow.
GLM-5.2 vs Qwen3.7-Max: Qwen3.7-Max is a dense model with different scaling characteristics. GLM-5.2 beats it on SWE-bench Pro (62.1 vs 60.6), AIME 2026 (99.2 vs 97), and Terminal-Bench 2.1 (81.0 vs 75). Qwen3.7-Max has stronger performance on HMMT Feb. 2026 (97.1 vs 92.5), so for pure math competition problems, it may be preferable. For coding and agentic work, GLM-5.2 is the stronger model. Qwen3.7-Max also requires significantly more VRAM for its dense architecture, making GLM-5.2 more efficient per parameter.
Choose GLM-5.2 when you need sustained long-context performance for coding agents, multi-file refactoring, or project-scale reasoning. Choose alternatives if you are VRAM-constrained below 160 GB or if your workload is primarily short-form chat where the 1M context is unnecessary.