
Zhipu AI's flagship 744B MoE model designed for continuous 8-hour autonomous engineering tasks and sustained multi-turn optimization.
Copy and paste this command to start running the model locally.
ollama run glm-5.1Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 79.3 GB | Low | |
| Q4_K_MRecommended | 87.7 GB | Good | |
| Q5_K_M | 91.7 GB | Very Good | |
| Q6_K | 96.5 GB | Excellent | |
| Q8_0 | 106.5 GB | Near Perfect | |
| FP16 | 144.5 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 44.1 tok/s | 87.7 GB | |
| SS | 48.6 tok/s | 87.7 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 73.4 tok/s | 87.7 GB | |
| SS | 55.1 tok/s | 87.7 GB | ||
| SS | 73.4 tok/s | 87.7 GB | ||
| SS | 34.0 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 65.2 tok/s | 87.7 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 65.2 tok/s | 87.7 GB | |
Google Cloud TPU v5pGoogle | BB | 25.4 tok/s | 87.7 GB | |
| BB | 7.3 tok/s | 87.7 GB | ||
| BB | 7.3 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 22.5 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 2.5 tok/s | 87.7 GB | ||
| BB | 2.5 tok/s | 87.7 GB |
GLM-5.1 is the latest flagship model from Z.ai (formerly Zhipu AI), engineered specifically for "agentic engineering" and long-horizon autonomous tasks. With a massive 744B total parameters utilizing a Mixture of Experts (MoE) architecture, it is designed to maintain high-level reasoning and productivity over sustained 8-hour sessions. Unlike previous generations that often plateau after initial attempts, GLM-5.1 is optimized for iterative refinement, making it a direct competitor to closed-source heavyweights like Claude 4.6 Opus and GPT-5.4.
Released under the MIT license, GLM-5.1 represents a significant milestone for local AI practitioners. It currently leads the SWE-Bench Pro leaderboard, signaling its intent to serve as the primary engine for autonomous coding agents. The model excels at breaking down ambiguous, multi-step problems, executing experiments, and self-correcting based on terminal feedback—capabilities that are critical for developers who want to run GLM-5.1 locally for secure, private development workflows.
The efficiency of GLM-5.1 stems from its Mixture of Experts (MoE) architecture. While the model contains 744B total parameters, it only activates 40B parameters per token during inference. This design provides the reasoning depth of a trillion-parameter class model while maintaining the inference speed and throughput of a much smaller dense model.
The 200k context window is a critical feature for engineers. It allows the model to ingest entire repositories or extensive documentation, facilitating the "long-horizon" tasks Z.ai emphasizes. Because only 40B parameters are active during the forward pass, the model achieves a higher tokens per second rate than dense models of similar total size, provided the hardware can accommodate the massive VRAM footprint required to store the full 744B weights.
GLM-5.1 is not a general-purpose chatbot; it is a specialized tool for complex reasoning and autonomous execution. Its training focuses on the "full loop" of engineering: planning, execution, result analysis, and iterative optimization.
Running a 744B parameter model locally is a significant hardware challenge. Even with MoE efficiency, the primary bottleneck is VRAM capacity, as the entire 744B model must be loaded into memory to avoid extreme latencies.
To run this model, you must choose a quantization level based on your available VRAM. Running the model in full FP16 is impractical for most local setups (~1.5TB VRAM).
ollama run glm-5.1. Note that Ollama will automatically handle the offloading if your system has the necessary VRAM.GLM-5.1 occupies the ultra-large-scale open-weights category. Its primary competitors are DeepSeek-V3 and Llama 3.1 405B.
For developers building local autonomous agents, GLM-5.1 is currently the highest-performing open-weights option available, provided you have the enterprise-grade hardware required to host it.