Z.ai

GLM-5

A massive 744B parameter open-weights MoE model integrating DeepSeek Sparse Attention, built for long-term planning and resource management.

744B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads459.1K

Likes2.1K

Last Updated19 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

86.0

SWE-Verified

72.8

HLE

30.5

AIME 2026

95.8

Terminal Bench

52.4

HMMT 2026

86.4

Arena Score

91.8

Overall Score

55.7BB

Benchmark40%

73.7

Popularity25%

60.6

Efficiency20%

20.4

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	87.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	91.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	96.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	106.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	144.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.1 tok/s	87.7 GB
AMD Instinct MI300XAMD	SS	48.6 tok/s	87.7 GB
NVIDIA B200 GPUNVIDIA	SS	73.4 tok/s	87.7 GB
AMD Instinct MI325XAMD	SS	55.1 tok/s	87.7 GB
AMD Instinct MI355XAMD	SS	73.4 tok/s	87.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.0 tok/s	87.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	65.2 tok/s	87.7 GB
Dell Pro Max with GB300Dell	SS	65.2 tok/s	87.7 GB
HP ZGX Fury AI StationHP	SS	65.2 tok/s	87.7 GB
MSI XpertStation WS300MSI	SS	65.2 tok/s	87.7 GB
SuperMicro Super AI StationSuperMicro	SS	65.2 tok/s	87.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	65.2 tok/s	87.7 GB
Google Cloud TPU v5pGoogle	BB	25.4 tok/s	87.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.3 tok/s	87.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.3 tok/s	87.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
Intel Gaudi 2 AI AcceleratorIntel	BB	22.5 tok/s	87.7 GB
Apple M4 Max (40-core GPU)Apple	BB	5.0 tok/s	87.7 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
Acer Veriton GN100 AI MiniAcer	BB	2.5 tok/s	87.7 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.5 tok/s	87.7 GB

Rows per page

Page 1 of 4

About This Model

GLM-5 is Z.ai’s flagship 744B parameter Mixture-of-Experts (MoE) model, engineered specifically for long-horizon agentic tasks and complex systems engineering. Released under the MIT license, it represents a significant push into the ultra-large-scale open-weights category, competing directly with frontier models like DeepSeek-V3 and Llama 3 405B. By leveraging a sparse architecture, GLM-5 attempts to bridge the gap between massive knowledge capacity and practical inference efficiency.

The model is a successor to the GLM-4 series, scaling the training data to 28.5T tokens. Its primary value proposition lies in its ability to handle "agentic engineering"—tasks that require the model to not just write snippets of code, but to reason through entire repositories, manage resource allocation, and execute multi-step planning. For developers looking to run GLM-5 locally, the model offers a high-intelligence alternative to proprietary APIs, provided they have the specialized hardware required to host a 744B parameter weights file.

Architecture and Technical Details

GLM-5 utilizes a Mixture-of-Experts (MoE) architecture that totals 744 billion parameters, yet only activates 40 billion parameters per token during inference. This disparity is critical for local practitioners: while the VRAM requirements are dictated by the full 744B parameter count, the inference speed (tokens per second) is more comparable to a 40B-50B dense model.

DeepSeek Sparse Attention (DSA)

A standout technical feature is the integration of DeepSeek Sparse Attention (DSA). This mechanism is designed to reduce the computational overhead of the model’s 200,000-token context window. By optimizing how the model attends to distant tokens, DSA allows for more efficient processing of massive codebases or long technical documents without the exponential KV cache growth typically seen in standard dense transformers.

Training and Infrastructure

Z.ai utilized a custom asynchronous reinforcement learning infrastructure called "slime" to improve training throughput. This resulted in a model with high "intelligence efficiency," meaning it achieves better reasoning and coding performance per FLOP compared to its predecessors. The 200k context length is fully functional for retrieval and reasoning, making it a viable candidate for local RAG (Retrieval-Augmented Generation) over large private datasets.

Capabilities and Use Cases

GLM-5 is tuned for high-logic, low-fluff outputs. It excels in environments where the model must act as an autonomous agent rather than a simple chatbot.

Agentic Engineering: Unlike models that plateau after a few turns, GLM-5 is optimized for long-horizon software development. This includes refactoring large modules, identifying architectural bottlenecks, and navigating complex dependencies in a codebase.
Function Calling & Tool Use: With native support for structured JSON outputs and robust tool invocation, GLM-5 can be integrated into local agent loops to interact with file systems, APIs, and terminal environments.
Complex Reasoning: The model’s reasoning benchmarks place it at the top of the open-source field, making it suitable for mathematical proofs, logical troubleshooting, and resource management simulations.
Multilingual Proficiency: While the training data is vast, the model shows particular strength in English and Chinese, making it a preferred choice for global development teams.

Running GLM-5 Locally

Running a 744B parameter model locally is a significant engineering challenge that moves beyond standard consumer hardware. To run GLM-5 locally, you must account for the massive footprint of the weights.

GLM-5 Hardware Requirements & VRAM

The primary bottleneck is VRAM. Even with aggressive quantization, a 744B model is too large for a single or even dual RTX 4090 setup.

FP16 (Original): ~1.5 TB of VRAM. Requires a multi-node H100/A100 cluster.
Q4_K_M (Recommended): ~420 GB - 450 GB of VRAM. This is the "sweet spot" for maintaining intelligence while fitting on a single workstation. This typically requires a 8x A6000 (48GB) or 4x A100 (80GB) configuration.
Q2_K (Minimum): ~240 GB - 260 GB of VRAM. Can potentially run on a Mac Studio with 192GB of Unified Memory (with heavy swapping) or a specialized multi-GPU head-less server.

Best Quantization for GLM-5

For most practitioners, Q4_K_M (4-bit) is the target. Using anything lower than 3-bit quantization on an MoE model often leads to "expert collapse," where the routing logic degrades, and the model loses its reasoning edge. If you are constrained by VRAM, it is often better to run a smaller dense model (like Llama 3.1 70B) at high precision than GLM-5 at 2-bit.

Performance and Deployment

Ollama: The fastest way to pull the model is via ollama run glm-5. Note that Ollama will automatically attempt to offload layers to your GPU/System RAM, but performance will be sub-1 token/second if it spills into system DDR4/DDR5 memory.
Tokens per Second: On a dedicated 8x A6000 setup using vLLM or GGUF (llama.cpp), expect between 5–15 tokens per second. The MoE architecture keeps the generation speed respectable despite the massive total parameter count.

How It Compares

GLM-5 sits in the "Ultra-Large" category. Its most direct competitors are DeepSeek-V3 and Llama 3.1 405B.

GLM-5 vs. DeepSeek-V3: Both use MoE and Sparse Attention. GLM-5 generally shows a slight edge in "agentic" tasks—specifically those involving terminal-based interactions and multi-step planning—whereas DeepSeek-V3 is often cited for its raw coding speed and mathematical prowess.
GLM-5 vs. Llama 3.1 405B: Llama 405B is a dense model, meaning every parameter is active during inference. While Llama may have more "crystallized" knowledge, GLM-5 is significantly faster to run once the VRAM hurdle is cleared, thanks to its 40B active parameter count.
GLM-5 vs. GLM-4.5: GLM-5 is a substantial upgrade, scaling from 355B to 744B total parameters. The jump in active parameters (32B to 40B) and the 5.5T token increase in pre-training data results in noticeably fewer hallucinations in long-context windows.

For developers building local autonomous agents or private "Vibe Coding" environments, GLM-5 is currently the most capable MIT-licensed model available in the 700B+ parameter class.

Related Models

Z.ai

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Z.ai

GLM-5

A massive 744B parameter open-weights MoE model integrating DeepSeek Sparse Attention, built for long-term planning and resource management.

744B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads459.1K

Likes2.1K

Last Updated19 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-5

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

86.0

SWE-Verified

72.8

HLE

30.5

AIME 2026

95.8

Terminal Bench

52.4

HMMT 2026

86.4

Arena Score

91.8

Overall Score

55.7BB

Benchmark40%

73.7

Popularity25%

60.6

Efficiency20%

20.4

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	87.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	91.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	96.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	106.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	144.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.1 tok/s	87.7 GB
AMD Instinct MI300XAMD	SS	48.6 tok/s	87.7 GB
NVIDIA B200 GPUNVIDIA	SS	73.4 tok/s	87.7 GB
AMD Instinct MI325XAMD	SS	55.1 tok/s	87.7 GB
AMD Instinct MI355XAMD	SS	73.4 tok/s	87.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.0 tok/s	87.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	65.2 tok/s	87.7 GB
Dell Pro Max with GB300Dell	SS	65.2 tok/s	87.7 GB
HP ZGX Fury AI StationHP	SS	65.2 tok/s	87.7 GB
MSI XpertStation WS300MSI	SS	65.2 tok/s	87.7 GB
SuperMicro Super AI StationSuperMicro	SS	65.2 tok/s	87.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	65.2 tok/s	87.7 GB
Google Cloud TPU v5pGoogle	BB	25.4 tok/s	87.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.3 tok/s	87.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.3 tok/s	87.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
Intel Gaudi 2 AI AcceleratorIntel	BB	22.5 tok/s	87.7 GB
Apple M4 Max (40-core GPU)Apple	BB	5.0 tok/s	87.7 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
Acer Veriton GN100 AI MiniAcer	BB	2.5 tok/s	87.7 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.5 tok/s	87.7 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Details

DeepSeek Sparse Attention (DSA)

Training and Infrastructure

Capabilities and Use Cases

GLM-5 is tuned for high-logic, low-fluff outputs. It excels in environments where the model must act as an autonomous agent rather than a simple chatbot.

Agentic Engineering: Unlike models that plateau after a few turns, GLM-5 is optimized for long-horizon software development. This includes refactoring large modules, identifying architectural bottlenecks, and navigating complex dependencies in a codebase.
Function Calling & Tool Use: With native support for structured JSON outputs and robust tool invocation, GLM-5 can be integrated into local agent loops to interact with file systems, APIs, and terminal environments.
Complex Reasoning: The model’s reasoning benchmarks place it at the top of the open-source field, making it suitable for mathematical proofs, logical troubleshooting, and resource management simulations.
Multilingual Proficiency: While the training data is vast, the model shows particular strength in English and Chinese, making it a preferred choice for global development teams.

Running GLM-5 Locally

GLM-5 Hardware Requirements & VRAM

The primary bottleneck is VRAM. Even with aggressive quantization, a 744B model is too large for a single or even dual RTX 4090 setup.

FP16 (Original): ~1.5 TB of VRAM. Requires a multi-node H100/A100 cluster.
Q4_K_M (Recommended): ~420 GB - 450 GB of VRAM. This is the "sweet spot" for maintaining intelligence while fitting on a single workstation. This typically requires a 8x A6000 (48GB) or 4x A100 (80GB) configuration.
Q2_K (Minimum): ~240 GB - 260 GB of VRAM. Can potentially run on a Mac Studio with 192GB of Unified Memory (with heavy swapping) or a specialized multi-GPU head-less server.

Best Quantization for GLM-5

Performance and Deployment

Ollama: The fastest way to pull the model is via ollama run glm-5. Note that Ollama will automatically attempt to offload layers to your GPU/System RAM, but performance will be sub-1 token/second if it spills into system DDR4/DDR5 memory.
Tokens per Second: On a dedicated 8x A6000 setup using vLLM or GGUF (llama.cpp), expect between 5–15 tokens per second. The MoE architecture keeps the generation speed respectable despite the massive total parameter count.

How It Compares

GLM-5 sits in the "Ultra-Large" category. Its most direct competitors are DeepSeek-V3 and Llama 3.1 405B.

GLM-5 vs. DeepSeek-V3: Both use MoE and Sparse Attention. GLM-5 generally shows a slight edge in "agentic" tasks—specifically those involving terminal-based interactions and multi-step planning—whereas DeepSeek-V3 is often cited for its raw coding speed and mathematical prowess.
GLM-5 vs. Llama 3.1 405B: Llama 405B is a dense model, meaning every parameter is active during inference. While Llama may have more "crystallized" knowledge, GLM-5 is significantly faster to run once the VRAM hurdle is cleared, thanks to its 40B active parameter count.
GLM-5 vs. GLM-4.5: GLM-5 is a substantial upgrade, scaling from 355B to 744B total parameters. The jump in active parameters (32B to 40B) and the 5.5T token increase in pre-training data results in noticeably fewer hallucinations in long-context windows.

For developers building local autonomous agents or private "Vibe Coding" environments, GLM-5 is currently the most capable MIT-licensed model available in the 700B+ parameter class.

Related Models

Z.ai

GLM-5.1

744BMoE

Z.ai

GLM-4.7

358BMoE

Z.ai

GLM-4.5

355BMoE

Z.ai

GLM-4.6

355BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.