Z.ai

GLM-5.2

Z.ai's open-weight flagship, a 744B-parameter Mixture-of-Experts model with 40B active per token and a native 1M-token context. Activates 8 of 256 experts per token and uses DeepSeek Sparse Attention with an IndexShare scheme that cuts per-token FLOPs by 2.9x at 1M context. Built for long-horizon coding and agentic work, with selectable reasoning effort (high/max, or thinking off). Scores 62.1 on SWE-bench Pro, 81.0 on Terminal-Bench 2.1, 74.4 on FrontierSWE, 91.2 on GPQA-Diamond, and 99.2 on AIME 2026, beating GPT-5.5 on several long-horizon coding benchmarks at a fraction of the cost. Released under the MIT license.

744B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A workable 744B-parameter MoE language model from Z.ai. Pulls ahead on graduate-level reasoning (GPQA) (90/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.

Run this onApple M3 Ultra (32-core CPU, 80-core GPU)Cheapest card in our directory with comfortable headroom (512 GB) for this model at Q4 (~343.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads67.1K

Likes2.4K

Last Updated3 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

89.5

HLE

40.1

SWE-Pro

62.1

AA Intelligence Index

51.1

50.5

50.8

73.3

71.3

MBA Open Score

50.7CC

Benchmark40%

61.1

Popularity25%

50.6

Efficiency20%

9.9

Versatility15%

77.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	335.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	343.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	347.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	352.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	362.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	400.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	16.6 tok/s	343.7 GB
Dell Pro Max with GB300Dell	AA	16.6 tok/s	343.7 GB
HP ZGX Fury AI StationHP	AA	16.6 tok/s	343.7 GB
MSI XpertStation WS300MSI	AA	16.6 tok/s	343.7 GB
SuperMicro Super AI StationSuperMicro	AA	16.6 tok/s	343.7 GB
Gigabyte W775-V10-L01Gigabyte	AA	16.6 tok/s	343.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	1.9 tok/s	343.7 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	1.9 tok/s	343.7 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	1.2 tok/s	343.7 GB
Acer Veriton GN100 AI MiniAcer	FF	0.6 tok/s	343.7 GB
AMD Instinct MI300XAMD	FF	12.4 tok/s	343.7 GB
AMD Instinct MI325XAMD	FF	14.1 tok/s	343.7 GB
AMD Instinct MI355XAMD	FF	18.7 tok/s	343.7 GB
AMD Radeon RX 7600 8GBAMD	FF	0.7 tok/s	343.7 GB
AMD Radeon RX 7700 XTAMD	FF	1.0 tok/s	343.7 GB
AMD Radeon RX 7800 XTAMD	FF	1.5 tok/s	343.7 GB
AMD Radeon RX 7900 XTAMD	FF	1.9 tok/s	343.7 GB
AMD Radeon RX 7900 XTXAMD	FF	2.2 tok/s	343.7 GB
AMD Radeon RX 9070AMD	FF	1.5 tok/s	343.7 GB
AMD Radeon RX 9070 XTAMD	FF	1.5 tok/s	343.7 GB
Apple M4Apple	FF	0.3 tok/s	343.7 GB
Apple M4 Max (40-core GPU)Apple	FF	1.3 tok/s	343.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	343.7 GB
Apple M5Apple	FF	0.4 tok/s	343.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.4 tok/s	343.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)GLM-5.2 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W	$2.78
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 344 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Overview

GLM-5.2 is Z.ai's open-weight flagship, a 744B-parameter Mixture-of-Experts model that activates only 40B parameters per token. It is built specifically for long-horizon tasks — sustained coding sessions, multi-step agentic workflows, and project-scale reasoning — where maintaining coherence across hundreds of thousands of tokens is the difference between a useful tool and a toy. Released under the MIT license, it competes directly with frontier closed-source models like Claude Opus 4.8 and GPT-5.5 on long-context coding benchmarks while running at a fraction of their cost.

The model uses 8 of 256 experts per token, which means inference throughput is closer to a 40B dense model than a 744B one. This is the key architectural decision that makes GLM-5.2 viable for local deployment despite its total parameter count. Z.ai has also introduced IndexShare, a sparse attention mechanism that reuses the same indexer across every four attention layers, reducing per-token FLOPs by 2.9x at 1M context length. This is not theoretical — it translates directly to faster generation and lower memory pressure when working with long inputs.

Architecture & Technical Details

GLM-5.2 is a text-only MoE transformer with a native 1,000,000-token context window. The 744B total parameters are distributed across 256 experts, with 8 experts activated per token. This 8/256 activation ratio means the model's compute cost per token is roughly equivalent to a 40B dense model, while retaining the knowledge capacity of the full parameter set.

The attention mechanism uses DeepSeek Sparse Attention with Z.ai's IndexShare scheme. Standard sparse attention requires computing a separate index for each attention layer, which becomes expensive at long contexts. IndexShare reuses a single indexer across groups of four layers, cutting the FLOP cost of attention computation by 2.9x at 1M tokens. For practitioners running long-context workloads, this means lower latency and the ability to fit larger batches into available VRAM.

Z.ai also improved the MTP (Multi-Token Prediction) layer used for speculative decoding. The acceptance length increases by up to 20% compared to GLM-5.1, which improves tokens-per-second during inference — particularly relevant when running on consumer hardware where every efficiency gain matters.

The model supports selectable reasoning effort levels: "high" and "max" for deep reasoning, or "thinking off" for latency-sensitive applications like chat. This gives you control over how much compute the model spends on each generation, which is useful when balancing quality against throughput on a single GPU.

Capabilities & Use Cases

GLM-5.2 is a general-purpose text model with strong performance across coding, reasoning, math, and instruction-following. Its capabilities include chat, code generation, function-calling, multilingual support (English and Chinese), and complex mathematical reasoning.

The model's standout use case is long-horizon coding. On SWE-bench Pro it scores 62.1, on Terminal-Bench 2.1 it scores 81.0, and on FrontierSWE it scores 74.4 — beating GPT-5.5 on all three benchmarks. These tests measure the model's ability to work through multi-step engineering tasks: fixing bugs across a codebase, implementing features with multiple files, and navigating terminal environments. This is not a model that generates a single function and stops — it is designed to sustain coherent reasoning over hours-long agentic sessions.

For reasoning-heavy workloads, GLM-5.2 scores 99.2 on AIME 2026 and 91.2 on GPQA-Diamond. The CritPt benchmark, which tests critical point identification in long technical documents, shows a dramatic improvement over GLM-5.1 (20.9 vs 4.6), indicating that the long-context training paid off in practical document understanding.

The model supports function-calling and tool use, making it suitable for agent frameworks that need to invoke APIs, query databases, or interact with file systems. The 1M context window means you can feed it an entire codebase or a full project specification without chunking.

Running GLM-5.2 Locally

This is where GLM-5.2's MoE architecture becomes critical. With 40B active parameters, the model does not require the VRAM you would expect for a 744B dense model. However, the full parameter set must still be loaded into memory.

VRAM requirements by quantization:

Q4_K_M: ~210 GB VRAM (minimum for local inference)
Q3_K_M: ~160 GB VRAM (usable with aggressive quantization)
Q2_K: ~105 GB VRAM (significant quality loss, not recommended for coding)

Consumer hardware reality: No single consumer GPU can run GLM-5.2. A single RTX 4090 (24 GB) or even an M4 Max (128 GB unified memory) is insufficient. To run this model locally, you need multi-GPU setups. A practical configuration is 4x RTX 4090s in a single system, which gives you 96 GB total — enough for Q2_K with careful memory management. For Q4_K_M, you need 8x RTX 4090s or professional hardware like 2x A100 80GB or 4x A6000.

Expected performance: With 4x RTX 4090s using Q2_K quantization, expect 2-5 tokens per second on long contexts. With 8x RTX 4090s at Q4_K_M, expect 5-10 tokens per second. These numbers vary significantly based on context length and the reasoning effort setting.

Ollama is the quickest way to get started. The model is available on HuggingFace as zai-org/GLM-5.2 and can be pulled into Ollama with a custom Modelfile. For production deployments, use llama.cpp or vLLM with tensor parallelism across multiple GPUs.

Best quantization for most users: Q4_K_M if you have the hardware. It offers the best balance of quality and efficiency for coding and reasoning tasks. Drop to Q3_K_M only if you are VRAM-constrained and need the model to fit on fewer cards.

How It Compares

GLM-5.2 vs DeepSeek-V4-Pro: Both are MoE models with similar active parameter counts. DeepSeek-V4-Pro is strong on general reasoning but GLM-5.2 leads on long-horizon coding benchmarks: 62.1 vs 55.4 on SWE-bench Pro, and 81.0 vs 64 on Terminal-Bench 2.1. If your workload is sustained coding sessions or agentic tasks, GLM-5.2 is the better choice. DeepSeek-V4-Pro may edge ahead on some reasoning benchmarks, but the gap is narrow.

GLM-5.2 vs Qwen3.7-Max: Qwen3.7-Max is a dense model with different scaling characteristics. GLM-5.2 beats it on SWE-bench Pro (62.1 vs 60.6), AIME 2026 (99.2 vs 97), and Terminal-Bench 2.1 (81.0 vs 75). Qwen3.7-Max has stronger performance on HMMT Feb. 2026 (97.1 vs 92.5), so for pure math competition problems, it may be preferable. For coding and agentic work, GLM-5.2 is the stronger model. Qwen3.7-Max also requires significantly more VRAM for its dense architecture, making GLM-5.2 more efficient per parameter.

Choose GLM-5.2 when you need sustained long-context performance for coding agents, multi-file refactoring, or project-scale reasoning. Choose alternatives if you are VRAM-constrained below 160 GB or if your workload is primarily short-form chat where the 1M context is unnecessary.

Related Models

Z.ai

Explore the Provider

See all Z.ai models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Z.ai model we track.

Open Z.ai

Explore the Family

See every GLM release

The full GLM family leaderboard with sizes, benchmark scores, and a release timeline.

Open GLM

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

744B

Z.ai

GLM-5.2

744B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

Run this onApple M3 Ultra (32-core CPU, 80-core GPU)Cheapest card in our directory with comfortable headroom (512 GB) for this model at Q4 (~343.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads67.1K

Likes2.4K

Last Updated3 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

89.5

HLE

40.1

SWE-Pro

62.1

AA Intelligence Index

51.1

50.5

50.8

73.3

71.3

MBA Open Score

50.7CC

Benchmark40%

61.1

Popularity25%

50.6

Efficiency20%

9.9

Versatility15%

77.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	335.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	343.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	347.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	352.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	362.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	400.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	16.6 tok/s	343.7 GB
Dell Pro Max with GB300Dell	AA	16.6 tok/s	343.7 GB
HP ZGX Fury AI StationHP	AA	16.6 tok/s	343.7 GB
MSI XpertStation WS300MSI	AA	16.6 tok/s	343.7 GB
SuperMicro Super AI StationSuperMicro	AA	16.6 tok/s	343.7 GB
Gigabyte W775-V10-L01Gigabyte	AA	16.6 tok/s	343.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	1.9 tok/s	343.7 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	1.9 tok/s	343.7 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	1.2 tok/s	343.7 GB
Acer Veriton GN100 AI MiniAcer	FF	0.6 tok/s	343.7 GB
AMD Instinct MI300XAMD	FF	12.4 tok/s	343.7 GB
AMD Instinct MI325XAMD	FF	14.1 tok/s	343.7 GB
AMD Instinct MI355XAMD	FF	18.7 tok/s	343.7 GB
AMD Radeon RX 7600 8GBAMD	FF	0.7 tok/s	343.7 GB
AMD Radeon RX 7700 XTAMD	FF	1.0 tok/s	343.7 GB
AMD Radeon RX 7800 XTAMD	FF	1.5 tok/s	343.7 GB
AMD Radeon RX 7900 XTAMD	FF	1.9 tok/s	343.7 GB
AMD Radeon RX 7900 XTXAMD	FF	2.2 tok/s	343.7 GB
AMD Radeon RX 9070AMD	FF	1.5 tok/s	343.7 GB
AMD Radeon RX 9070 XTAMD	FF	1.5 tok/s	343.7 GB
Apple M4Apple	FF	0.3 tok/s	343.7 GB
Apple M4 Max (40-core GPU)Apple	FF	1.3 tok/s	343.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	343.7 GB
Apple M5Apple	FF	0.4 tok/s	343.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.4 tok/s	343.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.9 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)GLM-5.2 on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.9 tok/s · 160W	$2.78
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63