Z.ai

GLM-4.7

An optimized 358B parameter MoE model featuring Interleaved and Preserved Thinking for stabilized multi-step task execution.

358B paramsMoE131K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters358B

Active Params32B

ArchitectureMoE

Context Length131K tokens

ModalityText Only

ProviderZ.ai

Download Size716.7 GB

Community

Monthly Downloads94.2K

Likes2.0K

Last Updated2 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-4.7

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

85.7

SWE-Verified

73.8

HLE

24.8

Terminal Bench

33.4

EvasionBench

82.9

Arena Score

90.4

Overall Score

50.5CC

Benchmark40%

65.2

Popularity25%

35.6

Efficiency20%

42.6

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	52.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	59.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	67.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	98.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Google Cloud TPU v5pGoogle	SS	42.3 tok/s	52.6 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	51.3 tok/s	52.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	56.6 tok/s	52.6 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	37.5 tok/s	52.6 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	73.4 tok/s	52.6 GB
AMD Instinct MI300XAMD	SS	81.1 tok/s	52.6 GB
NVIDIA B200 GPUNVIDIA	SS	122.4 tok/s	52.6 GB
AMD Instinct MI325XAMD	SS	91.8 tok/s	52.6 GB
AMD Instinct MI355XAMD	SS	122.4 tok/s	52.6 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.2 tok/s	52.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	108.6 tok/s	52.6 GB
Dell Pro Max with GB300Dell	SS	108.6 tok/s	52.6 GB
HP ZGX Fury AI StationHP	SS	108.6 tok/s	52.6 GB
MSI XpertStation WS300MSI	SS	108.6 tok/s	52.6 GB
SuperMicro Super AI StationSuperMicro	SS	108.6 tok/s	52.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	108.6 tok/s	52.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.2 tok/s	52.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.4 tok/s	52.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.4 tok/s	52.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.4 tok/s	52.6 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.1 tok/s	52.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.2 tok/s	52.6 GB
Apple M4 Max (40-core GPU)Apple	BB	8.4 tok/s	52.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	8.4 tok/s	52.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	8.4 tok/s	52.6 GB

Rows per page

Page 1 of 4

About This Model

GLM-4.7, developed by Z.ai, is a flagship 358B parameter Mixture-of-Experts (MoE) model designed for high-end local orchestration and complex agentic workflows. While its total parameter count is massive, the MoE architecture activates only 32B parameters during any single forward pass, positioning it as a direct competitor to other large-scale open-weight models like DeepSeek-V3 and Llama-3-405B. It is released under the permissive MIT license, making it a viable candidate for commercial local deployments where data privacy and custom fine-tuning are priorities.

What distinguishes GLM-4.7 from its predecessors and contemporaries is its focus on "Interleaved and Preserved Thinking." This approach stabilizes multi-step task execution, allowing the model to "think before acting" when integrated into agentic frameworks. For practitioners, this translates to higher reliability in long-running tasks such as multi-file software engineering, complex mathematical reasoning, and autonomous tool use.

Architecture and Technical Specs

GLM-4.7 utilizes a sparse Mixture-of-Experts (MoE) architecture. Out of its 358B total parameters, only 32B are active per token during inference. This provides a specific advantage for local deployments: you get the knowledge density and reasoning depth of a 300B+ model with the inference latency (tokens per second) more typical of a 30B-70B dense model.

Key technical specifications include:

Parameters: 358B (Total) / 32B (Active)
Architecture: MoE (Mixture of Experts)
Context Length: 131,072 tokens
Modality: Text-only
License: MIT
Inference Efficiency: Approximately 55 tokens per second (on optimized enterprise-grade hardware)

The 128k context window is particularly robust, supporting a maximum output capacity of 128,000 tokens. This makes GLM-4.7 suitable for "Long-Context Reasoning" (LCR) tasks, such as analyzing entire codebases or dense technical documentation, without the performance degradation often seen in smaller models when the KV cache fills up.

Capabilities and Use Cases

GLM-4.7 is optimized for high-logic, high-precision workloads. It is not merely a chat model; it is designed to function as the "brain" of an AI agent.

Advanced Coding and "Vibe Coding"

The model shows significant gains in what the developers call "Vibe Coding"—the ability to generate cleaner, modern UI/UX code with accurate layouts and sizing. On the SWE-bench Verified metric, GLM-4.7 scores 73.8%, indicating a high proficiency in generating real-world software patches. It is specifically optimized for use within agent frameworks like Claude Code, Cline, and Roo Code.

Mathematical Reasoning

With a 95.7% accuracy on AIME 2025 and an 85.7% on GPQA-Diamond, GLM-4.7 competes at the absolute frontier of mathematical and graduate-level scientific reasoning. It is an ideal choice for local RAG (Retrieval-Augmented Generation) systems in STEM fields where logical consistency is non-negotiable.

Agentic Orchestration and Tool Use

The model features stabilized function-calling capabilities. It excels at "Terminal Bench" tasks, where it must navigate a command-line interface to solve multi-step problems. Its ability to maintain state across long sequences of tool calls makes it a primary candidate for local autonomous agents that need to interact with local file systems or APIs.

Running GLM-4.7 Locally

Running a 358B parameter model locally is a significant hardware challenge. While the active parameters are low, you must still fit the total parameter weight into VRAM to avoid the massive performance hit of system RAM offloading.

VRAM Requirements and Quantization

The total VRAM required depends heavily on the quantization level. For a 358B parameter model:

FP16 (Original): Requires ~720GB VRAM. This is strictly for multi-node H100/A100 clusters.
Q4_K_M (Recommended): Requires ~210GB - 230GB VRAM. This can be achieved on a single workstation with 10x RTX 3090/4090s (24GB each) or a Mac Studio with 256GB Unified Memory.
Q2_K (Minimum): Requires ~120GB - 140GB VRAM. This is the bare minimum for "functional" inference on a multi-GPU setup (e.g., 6x RTX 3090s).

Hardware Recommendations

Consumer GPU Setup: 8x to 10x RTX 4090s using NVLink (where supported) or high-bandwidth PCIe lanes. Expect roughly 5–15 tokens per second depending on the backend (llama.cpp or vLLM).
Mac Silicon: An M2 Ultra or M4 Max Mac Studio with at least 192GB of Unified Memory is the most power-efficient way to run GLM-4.7.
Enterprise: A single NVIDIA H100 (80GB) is insufficient for the full model at usable bit-rates; you will need an H100 NVL (dual-GPU) or an A100 80GB 4-GPU node to run 4-bit quants comfortably.

Quick Start with Ollama

The fastest way to test GLM-4.7 on your local machine is via Ollama. If you have the requisite VRAM, you can pull the model directly:

ollama run glm-4.7

For users with limited hardware, look for GGUF quants specifically optimized for low-bit (IQ2_XS or Q3_K_S) to fit within smaller VRAM envelopes, though expect a noticeable drop in reasoning accuracy.

How It Compares

GLM-4.7 occupies the "Ultra-Large" category of open-weight models.

GLM-4.7 vs. DeepSeek-V3: Both use MoE architectures and have similar total parameter counts. GLM-4.7 generally shows stronger performance in "agentic" tasks and terminal-based reasoning, while DeepSeek-V3 may lead in raw coding speed and Chinese-language nuances.
GLM-4.7 vs. Llama-3-405B: Llama-3-405B is a dense model, meaning every parameter is active for every token. This makes Llama-3-405B significantly slower and more computationally expensive to run than GLM-4.7. GLM-4.7 offers a "snappier" local experience due to its 32B active parameters.
GLM-4.7 vs. Qwen-2.5-72B: While Qwen-2.5-72B is much easier to run on a dual-RTX 4090 setup, GLM-4.7's 358B parameter pool gives it a wider knowledge base and superior zero-shot performance on graduate-level reasoning benchmarks.

Choose GLM-4.7 if you are building complex, multi-step agents and have the hardware overhead to support a 200GB+ VRAM footprint. If you are limited to a single or dual GPU setup, consider a smaller dense model or a more aggressive quantization of GLM-4.7.

Related Models

Z.ai

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Z.ai

GLM-4.7

An optimized 358B parameter MoE model featuring Interleaved and Preserved Thinking for stabilized multi-step task execution.

358B paramsMoE131K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters358B

Active Params32B

ArchitectureMoE

Context Length131K tokens

ModalityText Only

ProviderZ.ai

Download Size716.7 GB

Community

Monthly Downloads94.2K

Likes2.0K

Last Updated2 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-4.7

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

GPQA

85.7

SWE-Verified

73.8

HLE

24.8

Terminal Bench

33.4

EvasionBench

82.9

Arena Score

90.4

Overall Score

50.5CC

Benchmark40%

65.2

Popularity25%

35.6

Efficiency20%

42.6

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	52.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	59.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	67.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	98.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Google Cloud TPU v5pGoogle	SS	42.3 tok/s	52.6 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	51.3 tok/s	52.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	56.6 tok/s	52.6 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	37.5 tok/s	52.6 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	73.4 tok/s	52.6 GB
AMD Instinct MI300XAMD	SS	81.1 tok/s	52.6 GB
NVIDIA B200 GPUNVIDIA	SS	122.4 tok/s	52.6 GB
AMD Instinct MI325XAMD	SS	91.8 tok/s	52.6 GB
AMD Instinct MI355XAMD	SS	122.4 tok/s	52.6 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.2 tok/s	52.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	108.6 tok/s	52.6 GB
Dell Pro Max with GB300Dell	SS	108.6 tok/s	52.6 GB
HP ZGX Fury AI StationHP	SS	108.6 tok/s	52.6 GB
MSI XpertStation WS300MSI	SS	108.6 tok/s	52.6 GB
SuperMicro Super AI StationSuperMicro	SS	108.6 tok/s	52.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	108.6 tok/s	52.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.2 tok/s	52.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.4 tok/s	52.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.4 tok/s	52.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.4 tok/s	52.6 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.1 tok/s	52.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.2 tok/s	52.6 GB
Apple M4 Max (40-core GPU)Apple	BB	8.4 tok/s	52.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	8.4 tok/s	52.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	8.4 tok/s	52.6 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Specs

Key technical specifications include:

Parameters: 358B (Total) / 32B (Active)
Architecture: MoE (Mixture of Experts)
Context Length: 131,072 tokens
Modality: Text-only
License: MIT
Inference Efficiency: Approximately 55 tokens per second (on optimized enterprise-grade hardware)

Capabilities and Use Cases

GLM-4.7 is optimized for high-logic, high-precision workloads. It is not merely a chat model; it is designed to function as the "brain" of an AI agent.

Advanced Coding and "Vibe Coding"

Mathematical Reasoning

Agentic Orchestration and Tool Use

Running GLM-4.7 Locally

VRAM Requirements and Quantization

The total VRAM required depends heavily on the quantization level. For a 358B parameter model:

FP16 (Original): Requires ~720GB VRAM. This is strictly for multi-node H100/A100 clusters.
Q4_K_M (Recommended): Requires ~210GB - 230GB VRAM. This can be achieved on a single workstation with 10x RTX 3090/4090s (24GB each) or a Mac Studio with 256GB Unified Memory.
Q2_K (Minimum): Requires ~120GB - 140GB VRAM. This is the bare minimum for "functional" inference on a multi-GPU setup (e.g., 6x RTX 3090s).

Hardware Recommendations

Consumer GPU Setup: 8x to 10x RTX 4090s using NVLink (where supported) or high-bandwidth PCIe lanes. Expect roughly 5–15 tokens per second depending on the backend (llama.cpp or vLLM).
Mac Silicon: An M2 Ultra or M4 Max Mac Studio with at least 192GB of Unified Memory is the most power-efficient way to run GLM-4.7.
Enterprise: A single NVIDIA H100 (80GB) is insufficient for the full model at usable bit-rates; you will need an H100 NVL (dual-GPU) or an A100 80GB 4-GPU node to run 4-bit quants comfortably.

Quick Start with Ollama

The fastest way to test GLM-4.7 on your local machine is via Ollama. If you have the requisite VRAM, you can pull the model directly:

ollama run glm-4.7

How It Compares

GLM-4.7 occupies the "Ultra-Large" category of open-weight models.

GLM-4.7 vs. DeepSeek-V3: Both use MoE architectures and have similar total parameter counts. GLM-4.7 generally shows stronger performance in "agentic" tasks and terminal-based reasoning, while DeepSeek-V3 may lead in raw coding speed and Chinese-language nuances.
GLM-4.7 vs. Llama-3-405B: Llama-3-405B is a dense model, meaning every parameter is active for every token. This makes Llama-3-405B significantly slower and more computationally expensive to run than GLM-4.7. GLM-4.7 offers a "snappier" local experience due to its 32B active parameters.
GLM-4.7 vs. Qwen-2.5-72B: While Qwen-2.5-72B is much easier to run on a dual-RTX 4090 setup, GLM-4.7's 358B parameter pool gives it a wider knowledge base and superior zero-shot performance on graduate-level reasoning benchmarks.

Related Models

Z.ai

GLM-5

744BMoE

Z.ai

GLM-5.1

744BMoE

Z.ai

GLM-4.5

355BMoE

Z.ai

GLM-4.6

355BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.