Z.ai

GLM-5.1

Zhipu AI's flagship 744B MoE model designed for continuous 8-hour autonomous engineering tasks and sustained multi-turn optimization.

744B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads204.3K

Likes1.5K

Last Updated9 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-5.1

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

86.2

HLE

31.0

AIME 2026

95.3

Terminal Bench

63.5

SWE-Pro

58.4

HMMT 2026

82.6

Arena Score

95.0

Overall Score

53.6CC

Benchmark40%

73.1

Popularity25%

51.1

Efficiency20%

18.5

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	87.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	91.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	96.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	106.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	144.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.1 tok/s	87.7 GB
AMD Instinct MI300XAMD	SS	48.6 tok/s	87.7 GB
NVIDIA B200 GPUNVIDIA	SS	73.4 tok/s	87.7 GB
AMD Instinct MI325XAMD	SS	55.1 tok/s	87.7 GB
AMD Instinct MI355XAMD	SS	73.4 tok/s	87.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.0 tok/s	87.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	65.2 tok/s	87.7 GB
Dell Pro Max with GB300Dell	SS	65.2 tok/s	87.7 GB
HP ZGX Fury AI StationHP	SS	65.2 tok/s	87.7 GB
MSI XpertStation WS300MSI	SS	65.2 tok/s	87.7 GB
SuperMicro Super AI StationSuperMicro	SS	65.2 tok/s	87.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	65.2 tok/s	87.7 GB
Google Cloud TPU v5pGoogle	BB	25.4 tok/s	87.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.3 tok/s	87.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.3 tok/s	87.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
Intel Gaudi 2 AI AcceleratorIntel	BB	22.5 tok/s	87.7 GB
Apple M4 Max (40-core GPU)Apple	BB	5.0 tok/s	87.7 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
Acer Veriton GN100 AI MiniAcer	BB	2.5 tok/s	87.7 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.5 tok/s	87.7 GB

Rows per page

Page 1 of 4

About This Model

Overview

GLM-5.1 is the latest flagship model from Z.ai (formerly Zhipu AI), engineered specifically for "agentic engineering" and long-horizon autonomous tasks. With a massive 744B total parameters utilizing a Mixture of Experts (MoE) architecture, it is designed to maintain high-level reasoning and productivity over sustained 8-hour sessions. Unlike previous generations that often plateau after initial attempts, GLM-5.1 is optimized for iterative refinement, making it a direct competitor to closed-source heavyweights like Claude 4.6 Opus and GPT-5.4.

Released under the MIT license, GLM-5.1 represents a significant milestone for local AI practitioners. It currently leads the SWE-Bench Pro leaderboard, signaling its intent to serve as the primary engine for autonomous coding agents. The model excels at breaking down ambiguous, multi-step problems, executing experiments, and self-correcting based on terminal feedback—capabilities that are critical for developers who want to run GLM-5.1 locally for secure, private development workflows.

Architecture & Technical Details

The efficiency of GLM-5.1 stems from its Mixture of Experts (MoE) architecture. While the model contains 744B total parameters, it only activates 40B parameters per token during inference. This design provides the reasoning depth of a trillion-parameter class model while maintaining the inference speed and throughput of a much smaller dense model.

Key Specifications:

Total Parameters: 744B
Active Parameters: 40B
Architecture: Mixture of Experts (MoE)
Context Length: 200,000 tokens
Modality: Text-only (optimized for code and logic)
License: MIT (Open Source)

The 200k context window is a critical feature for engineers. It allows the model to ingest entire repositories or extensive documentation, facilitating the "long-horizon" tasks Z.ai emphasizes. Because only 40B parameters are active during the forward pass, the model achieves a higher tokens per second rate than dense models of similar total size, provided the hardware can accommodate the massive VRAM footprint required to store the full 744B weights.

Capabilities & Use Cases

GLM-5.1 is not a general-purpose chatbot; it is a specialized tool for complex reasoning and autonomous execution. Its training focuses on the "full loop" of engineering: planning, execution, result analysis, and iterative optimization.

Primary Use Cases:

Autonomous Engineering Agents: Designed to power tools like OpenClaw or Claude Code, GLM-5.1 can manage hundreds of tool calls and thousands of lines of code over multi-hour sessions without losing coherence.
Advanced Function-Calling: The model supports complex tool-use scenarios, including real-world terminal tasks and API orchestrations.
Mathematical Reasoning: Built to handle high-level logic and math problems that require sustained chain-of-thought processing.
Production-Grade Repository Generation: Unlike models that struggle with folder structures, GLM-5.1 is optimized for NL2Repo (Natural Language to Repository) tasks.

Running GLM-5.1 Locally

Running a 744B parameter model locally is a significant hardware challenge. Even with MoE efficiency, the primary bottleneck is VRAM capacity, as the entire 744B model must be loaded into memory to avoid extreme latencies.

GLM-5.1 Hardware Requirements

To run this model, you must choose a quantization level based on your available VRAM. Running the model in full FP16 is impractical for most local setups (~1.5TB VRAM).

Minimum (Q2 Quantization): ~220GB - 250GB VRAM. This requires a multi-GPU setup, such as 3x A100 (80GB) or a high-end Mac Studio (M4 Ultra with 192GB+ unified memory, though swap may occur).
Recommended (Q4_K_M Quantization): ~400GB - 450GB VRAM. This is the "sweet spot" for practitioners, balancing intelligence and size. This typically requires a dedicated server with 8x RTX 6000 Ada or 8x H100/A100 GPUs.
Consumer GPU Limitations: A single RTX 4090 (24GB) cannot run GLM-5.1. Even a 4-way 4090 setup (96GB) is insufficient for the base 744B model unless using extreme, experimental 1-bit quantization which severely degrades reasoning.

Performance & Implementation

Best Quantization: For most practitioners, Q4_K_M via GGUF or EXL2 is recommended to preserve the model's SOTA reasoning capabilities.
Ollama: The fastest way to deploy is via ollama run glm-5.1. Note that Ollama will automatically handle the offloading if your system has the necessary VRAM.
Tokens Per Second (TPS): On an optimized 8x H100 cluster, expect 20-30 TPS. On unified memory systems like the M4 Max/Ultra, speeds will be significantly lower (2-5 TPS) due to memory bandwidth constraints.

How It Compares

GLM-5.1 occupies the ultra-large-scale open-weights category. Its primary competitors are DeepSeek-V3 and Llama 3.1 405B.

GLM-5.1 vs. DeepSeek-V3: While DeepSeek-V3 is highly efficient for general chat and coding, GLM-5.1 shows superior performance in "agentic" tasks—specifically those requiring the model to run a command, see an error, and try a different approach autonomously.
GLM-5.1 vs. Llama 3.1 405B: GLM-5.1 has a larger total parameter count (744B vs 405B) but uses MoE to stay competitive on speed. GLM-5.1 generally outperforms Llama 3.1 in specific software engineering benchmarks like SWE-Bench Pro.
GLM-5.1 vs. Claude 4.6 Opus: Benchmarks suggest GLM-5.1 is roughly at parity with Opus for coding. However, the local advantage is clear: GLM-5.1 offers the MIT license and zero-latency data privacy for sensitive enterprise codebases.

For developers building local autonomous agents, GLM-5.1 is currently the highest-performing open-weights option available, provided you have the enterprise-grade hardware required to host it.

Related Models

Z.ai

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Z.ai

GLM-5.1

Zhipu AI's flagship 744B MoE model designed for continuous 8-hour autonomous engineering tasks and sustained multi-turn optimization.

744B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Model Specifications

Parameters744B

Active Params40B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size1.5 TB

Community

Monthly Downloads204.3K

Likes1.5K

Last Updated9 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-5.1

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

86.2

HLE

31.0

AIME 2026

95.3

Terminal Bench

63.5

SWE-Pro

58.4

HMMT 2026

82.6

Arena Score

95.0

Overall Score

53.6CC

Benchmark40%

73.1

Popularity25%

51.1

Efficiency20%

18.5

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	87.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	91.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	96.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	106.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	144.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.1 tok/s	87.7 GB
AMD Instinct MI300XAMD	SS	48.6 tok/s	87.7 GB
NVIDIA B200 GPUNVIDIA	SS	73.4 tok/s	87.7 GB
AMD Instinct MI325XAMD	SS	55.1 tok/s	87.7 GB
AMD Instinct MI355XAMD	SS	73.4 tok/s	87.7 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.0 tok/s	87.7 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	65.2 tok/s	87.7 GB
Dell Pro Max with GB300Dell	SS	65.2 tok/s	87.7 GB
HP ZGX Fury AI StationHP	SS	65.2 tok/s	87.7 GB
MSI XpertStation WS300MSI	SS	65.2 tok/s	87.7 GB
SuperMicro Super AI StationSuperMicro	SS	65.2 tok/s	87.7 GB
Gigabyte W775-V10-L01Gigabyte	SS	65.2 tok/s	87.7 GB
Google Cloud TPU v5pGoogle	BB	25.4 tok/s	87.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.3 tok/s	87.7 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.3 tok/s	87.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.6 tok/s	87.7 GB
Intel Gaudi 2 AI AcceleratorIntel	BB	22.5 tok/s	87.7 GB
Apple M4 Max (40-core GPU)Apple	BB	5.0 tok/s	87.7 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.0 tok/s	87.7 GB
Acer Veriton GN100 AI MiniAcer	BB	2.5 tok/s	87.7 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.5 tok/s	87.7 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Key Specifications:

Total Parameters: 744B
Active Parameters: 40B
Architecture: Mixture of Experts (MoE)
Context Length: 200,000 tokens
Modality: Text-only (optimized for code and logic)
License: MIT (Open Source)

Capabilities & Use Cases

Primary Use Cases:

Autonomous Engineering Agents: Designed to power tools like OpenClaw or Claude Code, GLM-5.1 can manage hundreds of tool calls and thousands of lines of code over multi-hour sessions without losing coherence.
Advanced Function-Calling: The model supports complex tool-use scenarios, including real-world terminal tasks and API orchestrations.
Mathematical Reasoning: Built to handle high-level logic and math problems that require sustained chain-of-thought processing.
Production-Grade Repository Generation: Unlike models that struggle with folder structures, GLM-5.1 is optimized for NL2Repo (Natural Language to Repository) tasks.

Running GLM-5.1 Locally

GLM-5.1 Hardware Requirements

To run this model, you must choose a quantization level based on your available VRAM. Running the model in full FP16 is impractical for most local setups (~1.5TB VRAM).

Minimum (Q2 Quantization): ~220GB - 250GB VRAM. This requires a multi-GPU setup, such as 3x A100 (80GB) or a high-end Mac Studio (M4 Ultra with 192GB+ unified memory, though swap may occur).
Recommended (Q4_K_M Quantization): ~400GB - 450GB VRAM. This is the "sweet spot" for practitioners, balancing intelligence and size. This typically requires a dedicated server with 8x RTX 6000 Ada or 8x H100/A100 GPUs.
Consumer GPU Limitations: A single RTX 4090 (24GB) cannot run GLM-5.1. Even a 4-way 4090 setup (96GB) is insufficient for the base 744B model unless using extreme, experimental 1-bit quantization which severely degrades reasoning.

Performance & Implementation

Best Quantization: For most practitioners, Q4_K_M via GGUF or EXL2 is recommended to preserve the model's SOTA reasoning capabilities.
Ollama: The fastest way to deploy is via ollama run glm-5.1. Note that Ollama will automatically handle the offloading if your system has the necessary VRAM.
Tokens Per Second (TPS): On an optimized 8x H100 cluster, expect 20-30 TPS. On unified memory systems like the M4 Max/Ultra, speeds will be significantly lower (2-5 TPS) due to memory bandwidth constraints.

How It Compares

GLM-5.1 occupies the ultra-large-scale open-weights category. Its primary competitors are DeepSeek-V3 and Llama 3.1 405B.

GLM-5.1 vs. DeepSeek-V3: While DeepSeek-V3 is highly efficient for general chat and coding, GLM-5.1 shows superior performance in "agentic" tasks—specifically those requiring the model to run a command, see an error, and try a different approach autonomously.
GLM-5.1 vs. Llama 3.1 405B: GLM-5.1 has a larger total parameter count (744B vs 405B) but uses MoE to stay competitive on speed. GLM-5.1 generally outperforms Llama 3.1 in specific software engineering benchmarks like SWE-Bench Pro.
GLM-5.1 vs. Claude 4.6 Opus: Benchmarks suggest GLM-5.1 is roughly at parity with Opus for coding. However, the local advantage is clear: GLM-5.1 offers the MIT license and zero-latency data privacy for sensitive enterprise codebases.

For developers building local autonomous agents, GLM-5.1 is currently the highest-performing open-weights option available, provided you have the enterprise-grade hardware required to host it.

Related Models

Z.ai

GLM-5

744BMoE

Z.ai

GLM-4.7

358BMoE

Z.ai

GLM-4.5

355BMoE

Z.ai

GLM-4.6

355BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.