
The foundational model of the GLM-4 series unifying reasoning, coding, and tool-use under a massive 355B framework.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 45.1 GB | Low | |
| Q4_K_MRecommended | 51.8 GB | Good | |
| Q5_K_M | 55.0 GB | Very Good | |
| Q6_K | 58.9 GB | Excellent | |
| Q8_0 | 66.9 GB | Near Perfect | |
| FP16 | 97.3 GB | Full |
See which devices can run this model and at what quality level.
Google Cloud TPU v5pGoogle | SS | 42.9 tok/s | 51.8 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 52.0 tok/s | 51.8 GB | |
| SS | 38.1 tok/s | 51.8 GB | ||
| SS | 57.5 tok/s | 51.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 74.6 tok/s | 51.8 GB | |
| SS | 82.3 tok/s | 51.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 124.3 tok/s | 51.8 GB | |
| SS | 93.2 tok/s | 51.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 31.7 tok/s | 51.8 GB | |
| SS | 124.3 tok/s | 51.8 GB | ||
| SS | 110.3 tok/s | 51.8 GB | ||
| SS | 110.3 tok/s | 51.8 GB | ||
| SS | 110.3 tok/s | 51.8 GB | ||
| SS | 110.3 tok/s | 51.8 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 110.3 tok/s | 51.8 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 110.3 tok/s | 51.8 GB | |
| AA | 12.4 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 6.2 tok/s | 51.8 GB | ||
| BB | 12.4 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB |
GLM-4.5 is a large-scale Mixture-of-Experts (MoE) foundation model developed by Z.ai. With a total parameter count of 355B, it is designed to compete with the industry's largest proprietary models while remaining accessible via an MIT license. Unlike dense models of similar scale, GLM-4.5 utilizes a sparse architecture that only activates 32B parameters during any single forward pass, significantly reducing the computational overhead for inference without sacrificing the reasoning depth associated with high-parameter counts.
This model is specifically engineered for "agentic" workflows, focusing on multi-step reasoning, complex code generation, and tool invocation. It bridges the gap between general-purpose chat models and specialized reasoning engines, making it a primary candidate for developers building local autonomous agents or sophisticated RAG (Retrieval-Augmented Generation) pipelines.
The defining characteristic of GLM-4.5 is its MoE (Mixture-of-Experts) architecture. While the model sits at a massive 355B parameters, the 32B active parameters mean that its inference latency is more comparable to a medium-sized dense model than a monolithic 300B+ model. This efficiency makes it possible to run GLM-4.5 locally on high-end workstation hardware that would otherwise struggle with dense models of this magnitude.
The 128k context window is a critical feature for practitioners. It allows for the ingestion of entire code repositories or lengthy technical documentation, which is essential for the model’s primary use cases in coding and complex reasoning. The MIT license is a notable differentiator, providing significantly more freedom for secondary development and commercial deployment compared to the more restrictive licenses found in the Llama or Mistral families.
GLM-4.5 is positioned as a foundation model for "ARC" (Agentic, Reasoning, and Coding). In practice, this translates to superior performance in tasks that require logical branching and precise instruction following.
The model features a native "Thinking Mode," which allows it to perform internal chain-of-thought processing before delivering a final answer. This is particularly effective for mathematical problem solving and scientific computing. On benchmarks like GPQA Diamond and MATH 500, GLM-4.5 has demonstrated performance that rivals top-tier proprietary models, making it a viable local alternative for high-stakes analytical tasks.
Z.ai has optimized GLM-4.5 for tool invocation and web browsing. For developers building agents, this means the model is less likely to hallucinate function arguments and more capable of handling multi-turn tool interactions. It is natively compatible with agent frameworks like Claude Code and Roo Code.
GLM-4.5 excels in "Vibe Coding" (UI/UX generation) and core software engineering. It produces modern, clean code for web front-ends and handles complex terminal-based tasks with high accuracy. Because it was trained on a diverse multilingual corpus, it maintains high performance in both English and Chinese coding environments.
Running a 355B parameter model locally is a significant hardware undertaking. Even with the MoE architecture's efficiency in compute, the VRAM requirements are dictated by the total parameter count, as the entire model must typically reside in memory.
The "weights-in-memory" rule applies here. At 16-bit precision (FP16), the model would require over 700GB of VRAM, which is impractical for local setups. Practitioners must use quantization to bring this model within reach of workstation hardware.
The quickest way to run GLM-4.5 locally is via Ollama, which handles the GGUF quantizations and MoE routing logic automatically. For those seeking maximum throughput, vLLM or SGLang are preferred, as they support FP8 versions of the model and optimized kernels for the MoE architecture. In a multi-GPU environment, expect between 5 to 15 tokens per second depending on your quantization level and interconnect speed.
When evaluating whether to run GLM-4.5 locally, it is most often compared against other high-parameter open-weight models.
Llama 3.1 405B is a dense model, meaning every parameter is active during inference. While Llama 405B may offer slightly higher general knowledge saturation, GLM-4.5's MoE architecture makes it significantly faster in terms of tokens per second on comparable hardware. Furthermore, GLM-4.5’s MIT license is less restrictive for large-scale commercial applications than Meta’s Community License.
DeepSeek-V3 is another prominent MoE model. While DeepSeek often leads in pure coding benchmarks, GLM-4.5 is frequently cited for better "agentic" alignment—specifically its ability to follow complex system prompts and handle multi-step tool use without drifting. GLM-4.5 also offers a more robust "Thinking Mode" out of the box for users who need explicit chain-of-thought reasoning.
For users who cannot meet the 200GB+ VRAM requirement, the GLM-4.5-Air variant is the logical alternative. It features 106B total parameters (12B active) and can fit into ~64GB - 80GB of VRAM at 4-bit quantization, making it runnable on a single A100 (80GB) or a dual RTX 4090 setup. While the "Air" version is more efficient, the full GLM-4.5 provides a noticeable jump in "Humanity’s Last Exam" (HLE) and other graduate-level reasoning benchmarks.