
A massive 744B parameter open-weights MoE model integrating DeepSeek Sparse Attention, built for long-term planning and resource management.
Copy and paste this command to start running the model locally.
ollama run glm-5Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 79.3 GB | Low | |
| Q4_K_MRecommended | 87.7 GB | Good | |
| Q5_K_M | 91.7 GB | Very Good | |
| Q6_K | 96.5 GB | Excellent | |
| Q8_0 | 106.5 GB | Near Perfect | |
| FP16 | 144.5 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 44.1 tok/s | 87.7 GB | |
| SS | 48.6 tok/s | 87.7 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 73.4 tok/s | 87.7 GB | |
| SS | 55.1 tok/s | 87.7 GB | ||
| SS | 73.4 tok/s | 87.7 GB | ||
| SS | 34.0 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
| SS | 65.2 tok/s | 87.7 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 65.2 tok/s | 87.7 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 65.2 tok/s | 87.7 GB | |
Google Cloud TPU v5pGoogle | BB | 25.4 tok/s | 87.7 GB | |
| BB | 7.3 tok/s | 87.7 GB | ||
| BB | 7.3 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 5.6 tok/s | 87.7 GB | ||
| BB | 22.5 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 5.0 tok/s | 87.7 GB | ||
| BB | 2.5 tok/s | 87.7 GB | ||
| BB | 2.5 tok/s | 87.7 GB |
GLM-5 is Z.ai’s flagship 744B parameter Mixture-of-Experts (MoE) model, engineered specifically for long-horizon agentic tasks and complex systems engineering. Released under the MIT license, it represents a significant push into the ultra-large-scale open-weights category, competing directly with frontier models like DeepSeek-V3 and Llama 3 405B. By leveraging a sparse architecture, GLM-5 attempts to bridge the gap between massive knowledge capacity and practical inference efficiency.
The model is a successor to the GLM-4 series, scaling the training data to 28.5T tokens. Its primary value proposition lies in its ability to handle "agentic engineering"—tasks that require the model to not just write snippets of code, but to reason through entire repositories, manage resource allocation, and execute multi-step planning. For developers looking to run GLM-5 locally, the model offers a high-intelligence alternative to proprietary APIs, provided they have the specialized hardware required to host a 744B parameter weights file.
GLM-5 utilizes a Mixture-of-Experts (MoE) architecture that totals 744 billion parameters, yet only activates 40 billion parameters per token during inference. This disparity is critical for local practitioners: while the VRAM requirements are dictated by the full 744B parameter count, the inference speed (tokens per second) is more comparable to a 40B-50B dense model.
A standout technical feature is the integration of DeepSeek Sparse Attention (DSA). This mechanism is designed to reduce the computational overhead of the model’s 200,000-token context window. By optimizing how the model attends to distant tokens, DSA allows for more efficient processing of massive codebases or long technical documents without the exponential KV cache growth typically seen in standard dense transformers.
Z.ai utilized a custom asynchronous reinforcement learning infrastructure called "slime" to improve training throughput. This resulted in a model with high "intelligence efficiency," meaning it achieves better reasoning and coding performance per FLOP compared to its predecessors. The 200k context length is fully functional for retrieval and reasoning, making it a viable candidate for local RAG (Retrieval-Augmented Generation) over large private datasets.
GLM-5 is tuned for high-logic, low-fluff outputs. It excels in environments where the model must act as an autonomous agent rather than a simple chatbot.
Running a 744B parameter model locally is a significant engineering challenge that moves beyond standard consumer hardware. To run GLM-5 locally, you must account for the massive footprint of the weights.
The primary bottleneck is VRAM. Even with aggressive quantization, a 744B model is too large for a single or even dual RTX 4090 setup.
For most practitioners, Q4_K_M (4-bit) is the target. Using anything lower than 3-bit quantization on an MoE model often leads to "expert collapse," where the routing logic degrades, and the model loses its reasoning edge. If you are constrained by VRAM, it is often better to run a smaller dense model (like Llama 3.1 70B) at high precision than GLM-5 at 2-bit.
ollama run glm-5. Note that Ollama will automatically attempt to offload layers to your GPU/System RAM, but performance will be sub-1 token/second if it spills into system DDR4/DDR5 memory.GLM-5 sits in the "Ultra-Large" category. Its most direct competitors are DeepSeek-V3 and Llama 3.1 405B.
For developers building local autonomous agents or private "Vibe Coding" environments, GLM-5 is currently the most capable MIT-licensed model available in the 700B+ parameter class.