
An optimized 358B parameter MoE model featuring Interleaved and Preserved Thinking for stabilized multi-step task execution.
Copy and paste this command to start running the model locally.
ollama run glm-4.7Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 45.9 GB | Low | |
| Q4_K_MRecommended | 52.6 GB | Good | |
| Q5_K_M | 55.8 GB | Very Good | |
| Q6_K | 59.7 GB | Excellent | |
| Q8_0 | 67.7 GB | Near Perfect | |
| FP16 | 98.1 GB | Full |
See which devices can run this model and at what quality level.
Google Cloud TPU v5pGoogle | SS | 42.3 tok/s | 52.6 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 51.3 tok/s | 52.6 GB | |
| SS | 56.6 tok/s | 52.6 GB | ||
| SS | 37.5 tok/s | 52.6 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 73.4 tok/s | 52.6 GB | |
| SS | 81.1 tok/s | 52.6 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 122.4 tok/s | 52.6 GB | |
| SS | 91.8 tok/s | 52.6 GB | ||
| SS | 122.4 tok/s | 52.6 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 31.2 tok/s | 52.6 GB | |
| SS | 108.6 tok/s | 52.6 GB | ||
| SS | 108.6 tok/s | 52.6 GB | ||
| SS | 108.6 tok/s | 52.6 GB | ||
| SS | 108.6 tok/s | 52.6 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 108.6 tok/s | 52.6 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 108.6 tok/s | 52.6 GB | |
| AA | 12.2 tok/s | 52.6 GB | ||
| BB | 9.4 tok/s | 52.6 GB | ||
| BB | 9.4 tok/s | 52.6 GB | ||
| BB | 9.4 tok/s | 52.6 GB | ||
| BB | 6.1 tok/s | 52.6 GB | ||
| BB | 12.2 tok/s | 52.6 GB | ||
| BB | 8.4 tok/s | 52.6 GB | ||
| BB | 8.4 tok/s | 52.6 GB | ||
| BB | 8.4 tok/s | 52.6 GB |
GLM-4.7, developed by Z.ai, is a flagship 358B parameter Mixture-of-Experts (MoE) model designed for high-end local orchestration and complex agentic workflows. While its total parameter count is massive, the MoE architecture activates only 32B parameters during any single forward pass, positioning it as a direct competitor to other large-scale open-weight models like DeepSeek-V3 and Llama-3-405B. It is released under the permissive MIT license, making it a viable candidate for commercial local deployments where data privacy and custom fine-tuning are priorities.
What distinguishes GLM-4.7 from its predecessors and contemporaries is its focus on "Interleaved and Preserved Thinking." This approach stabilizes multi-step task execution, allowing the model to "think before acting" when integrated into agentic frameworks. For practitioners, this translates to higher reliability in long-running tasks such as multi-file software engineering, complex mathematical reasoning, and autonomous tool use.
GLM-4.7 utilizes a sparse Mixture-of-Experts (MoE) architecture. Out of its 358B total parameters, only 32B are active per token during inference. This provides a specific advantage for local deployments: you get the knowledge density and reasoning depth of a 300B+ model with the inference latency (tokens per second) more typical of a 30B-70B dense model.
Key technical specifications include:
The 128k context window is particularly robust, supporting a maximum output capacity of 128,000 tokens. This makes GLM-4.7 suitable for "Long-Context Reasoning" (LCR) tasks, such as analyzing entire codebases or dense technical documentation, without the performance degradation often seen in smaller models when the KV cache fills up.
GLM-4.7 is optimized for high-logic, high-precision workloads. It is not merely a chat model; it is designed to function as the "brain" of an AI agent.
The model shows significant gains in what the developers call "Vibe Coding"—the ability to generate cleaner, modern UI/UX code with accurate layouts and sizing. On the SWE-bench Verified metric, GLM-4.7 scores 73.8%, indicating a high proficiency in generating real-world software patches. It is specifically optimized for use within agent frameworks like Claude Code, Cline, and Roo Code.
With a 95.7% accuracy on AIME 2025 and an 85.7% on GPQA-Diamond, GLM-4.7 competes at the absolute frontier of mathematical and graduate-level scientific reasoning. It is an ideal choice for local RAG (Retrieval-Augmented Generation) systems in STEM fields where logical consistency is non-negotiable.
The model features stabilized function-calling capabilities. It excels at "Terminal Bench" tasks, where it must navigate a command-line interface to solve multi-step problems. Its ability to maintain state across long sequences of tool calls makes it a primary candidate for local autonomous agents that need to interact with local file systems or APIs.
Running a 358B parameter model locally is a significant hardware challenge. While the active parameters are low, you must still fit the total parameter weight into VRAM to avoid the massive performance hit of system RAM offloading.
The total VRAM required depends heavily on the quantization level. For a 358B parameter model:
The fastest way to test GLM-4.7 on your local machine is via Ollama. If you have the requisite VRAM, you can pull the model directly:
ollama run glm-4.7
For users with limited hardware, look for GGUF quants specifically optimized for low-bit (IQ2_XS or Q3_K_S) to fit within smaller VRAM envelopes, though expect a noticeable drop in reasoning accuracy.
GLM-4.7 occupies the "Ultra-Large" category of open-weight models.
Choose GLM-4.7 if you are building complex, multi-step agents and have the hardware overhead to support a 200GB+ VRAM footprint. If you are limited to a single or dual GPU setup, consider a smaller dense model or a more aggressive quantization of GLM-4.7.