Z.ai

GLM-4.5

The foundational model of the GLM-4 series unifying reasoning, coding, and tool-use under a massive 355B framework.

355B paramsMoE128K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Model Specifications

Parameters355B

Active Params32B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

ProviderZ.ai

Download Size716.7 GB

Community

Monthly Downloads81.6K

Likes1.4K

Last Updated8 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

Arena Score

89.3

Overall Score

61.5BB

Benchmark40%

89.3

Popularity25%

29.4

Efficiency20%

61.1

Versatility15%

41.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	51.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	66.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	97.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Google Cloud TPU v5pGoogle	SS	42.9 tok/s	51.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	52.0 tok/s	51.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	38.1 tok/s	51.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	57.5 tok/s	51.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	74.6 tok/s	51.8 GB
AMD Instinct MI300XAMD	SS	82.3 tok/s	51.8 GB
NVIDIA B200 GPUNVIDIA	SS	124.3 tok/s	51.8 GB
AMD Instinct MI325XAMD	SS	93.2 tok/s	51.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.7 tok/s	51.8 GB
AMD Instinct MI355XAMD	SS	124.3 tok/s	51.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	110.3 tok/s	51.8 GB
Dell Pro Max with GB300Dell	SS	110.3 tok/s	51.8 GB
HP ZGX Fury AI StationHP	SS	110.3 tok/s	51.8 GB
MSI XpertStation WS300MSI	SS	110.3 tok/s	51.8 GB
SuperMicro Super AI StationSuperMicro	SS	110.3 tok/s	51.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	110.3 tok/s	51.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.4 tok/s	51.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.2 tok/s	51.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.4 tok/s	51.8 GB
Apple M4 Max (40-core GPU)Apple	BB	8.5 tok/s	51.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	8.5 tok/s	51.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	8.5 tok/s	51.8 GB

Rows per page

Page 1 of 4

About This Model

GLM-4.5 is a large-scale Mixture-of-Experts (MoE) foundation model developed by Z.ai. With a total parameter count of 355B, it is designed to compete with the industry's largest proprietary models while remaining accessible via an MIT license. Unlike dense models of similar scale, GLM-4.5 utilizes a sparse architecture that only activates 32B parameters during any single forward pass, significantly reducing the computational overhead for inference without sacrificing the reasoning depth associated with high-parameter counts.

This model is specifically engineered for "agentic" workflows, focusing on multi-step reasoning, complex code generation, and tool invocation. It bridges the gap between general-purpose chat models and specialized reasoning engines, making it a primary candidate for developers building local autonomous agents or sophisticated RAG (Retrieval-Augmented Generation) pipelines.

Architecture and Technical Specs

The defining characteristic of GLM-4.5 is its MoE (Mixture-of-Experts) architecture. While the model sits at a massive 355B parameters, the 32B active parameters mean that its inference latency is more comparable to a medium-sized dense model than a monolithic 300B+ model. This efficiency makes it possible to run GLM-4.5 locally on high-end workstation hardware that would otherwise struggle with dense models of this magnitude.

Key Specifications:

Parameters: 355B Total / 32B Active
Architecture: Mixture-of-Experts (MoE)
Context Length: 128,000 tokens
License: MIT (Commercial use permitted)
Modality: Text-only
Inference Modes: Supports "Thinking" (complex reasoning/tool use) and "Non-Thinking" (fast response) modes.

The 128k context window is a critical feature for practitioners. It allows for the ingestion of entire code repositories or lengthy technical documentation, which is essential for the model’s primary use cases in coding and complex reasoning. The MIT license is a notable differentiator, providing significantly more freedom for secondary development and commercial deployment compared to the more restrictive licenses found in the Llama or Mistral families.

Capabilities and Practitioner Use Cases

GLM-4.5 is positioned as a foundation model for "ARC" (Agentic, Reasoning, and Coding). In practice, this translates to superior performance in tasks that require logical branching and precise instruction following.

Advanced Reasoning and Logic

The model features a native "Thinking Mode," which allows it to perform internal chain-of-thought processing before delivering a final answer. This is particularly effective for mathematical problem solving and scientific computing. On benchmarks like GPQA Diamond and MATH 500, GLM-4.5 has demonstrated performance that rivals top-tier proprietary models, making it a viable local alternative for high-stakes analytical tasks.

Agentic Tool Use

Z.ai has optimized GLM-4.5 for tool invocation and web browsing. For developers building agents, this means the model is less likely to hallucinate function arguments and more capable of handling multi-turn tool interactions. It is natively compatible with agent frameworks like Claude Code and Roo Code.

Multilingual Coding

GLM-4.5 excels in "Vibe Coding" (UI/UX generation) and core software engineering. It produces modern, clean code for web front-ends and handles complex terminal-based tasks with high accuracy. Because it was trained on a diverse multilingual corpus, it maintains high performance in both English and Chinese coding environments.

Running GLM-4.5 Locally: Hardware Requirements

Running a 355B parameter model locally is a significant hardware undertaking. Even with the MoE architecture's efficiency in compute, the VRAM requirements are dictated by the total parameter count, as the entire model must typically reside in memory.

VRAM Requirements and Quantization

The "weights-in-memory" rule applies here. At 16-bit precision (FP16), the model would require over 700GB of VRAM, which is impractical for local setups. Practitioners must use quantization to bring this model within reach of workstation hardware.

4-bit Quantization (Q4_K_M): Requires approximately 200GB - 220GB of VRAM. This is the recommended "sweet spot" for maintaining intelligence while fitting on high-end hardware.
8-bit Quantization (FP8/Q8): Requires approximately 380GB - 400GB of VRAM. Recommended only for those with multi-node H100/A100 setups or massive Mac Studio configurations.

Recommended Hardware Configurations

Apple Silicon: An M2 Ultra or M4 Max Mac Studio with 192GB+ of Unified Memory is the most accessible way to run GLM-4.5. While a 192GB machine will be tight for 4-bit, it can run lower quantizations (Q3_K_L) relatively comfortably.
NVIDIA Multi-GPU: A setup with 8x RTX 4090 (24GB each) provides 192GB of VRAM. Using EXL2 or GGUF formats, you can run a 3.5-bit or 4-bit quantization.
Data Center Grade: 4x A6000 (48GB each) or 4x L40S cards are ideal for a stable 4-bit local deployment.

Software and Performance

The quickest way to run GLM-4.5 locally is via Ollama, which handles the GGUF quantizations and MoE routing logic automatically. For those seeking maximum throughput, vLLM or SGLang are preferred, as they support FP8 versions of the model and optimized kernels for the MoE architecture. In a multi-GPU environment, expect between 5 to 15 tokens per second depending on your quantization level and interconnect speed.

How It Compares

When evaluating whether to run GLM-4.5 locally, it is most often compared against other high-parameter open-weight models.

GLM-4.5 vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is active during inference. While Llama 405B may offer slightly higher general knowledge saturation, GLM-4.5's MoE architecture makes it significantly faster in terms of tokens per second on comparable hardware. Furthermore, GLM-4.5’s MIT license is less restrictive for large-scale commercial applications than Meta’s Community License.

GLM-4.5 vs. DeepSeek-V3

DeepSeek-V3 is another prominent MoE model. While DeepSeek often leads in pure coding benchmarks, GLM-4.5 is frequently cited for better "agentic" alignment—specifically its ability to follow complex system prompts and handle multi-step tool use without drifting. GLM-4.5 also offers a more robust "Thinking Mode" out of the box for users who need explicit chain-of-thought reasoning.

GLM-4.5 vs. GLM-4.5-Air

For users who cannot meet the 200GB+ VRAM requirement, the GLM-4.5-Air variant is the logical alternative. It features 106B total parameters (12B active) and can fit into ~64GB - 80GB of VRAM at 4-bit quantization, making it runnable on a single A100 (80GB) or a dual RTX 4090 setup. While the "Air" version is more efficient, the full GLM-4.5 provides a noticeable jump in "Humanity’s Last Exam" (HLE) and other graduate-level reasoning benchmarks.

Related Models

Z.ai

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Z.ai

GLM-4.5

The foundational model of the GLM-4 series unifying reasoning, coding, and tool-use under a massive 355B framework.

355B paramsMoE128K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Model Specifications

Parameters355B

Active Params32B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

ProviderZ.ai

Download Size716.7 GB

Community

Monthly Downloads81.6K

Likes1.4K

Last Updated8 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

Arena Score

89.3

Overall Score

61.5BB

Benchmark40%

89.3

Popularity25%

29.4

Efficiency20%

61.1

Versatility15%

41.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	51.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	66.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	97.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Google Cloud TPU v5pGoogle	SS	42.9 tok/s	51.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	52.0 tok/s	51.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	38.1 tok/s	51.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	57.5 tok/s	51.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	74.6 tok/s	51.8 GB
AMD Instinct MI300XAMD	SS	82.3 tok/s	51.8 GB
NVIDIA B200 GPUNVIDIA	SS	124.3 tok/s	51.8 GB
AMD Instinct MI325XAMD	SS	93.2 tok/s	51.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.7 tok/s	51.8 GB
AMD Instinct MI355XAMD	SS	124.3 tok/s	51.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	110.3 tok/s	51.8 GB
Dell Pro Max with GB300Dell	SS	110.3 tok/s	51.8 GB
HP ZGX Fury AI StationHP	SS	110.3 tok/s	51.8 GB
MSI XpertStation WS300MSI	SS	110.3 tok/s	51.8 GB
SuperMicro Super AI StationSuperMicro	SS	110.3 tok/s	51.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	110.3 tok/s	51.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.4 tok/s	51.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.2 tok/s	51.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.4 tok/s	51.8 GB
Apple M4 Max (40-core GPU)Apple	BB	8.5 tok/s	51.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	8.5 tok/s	51.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	8.5 tok/s	51.8 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Specs

Key Specifications:

Parameters: 355B Total / 32B Active
Architecture: Mixture-of-Experts (MoE)
Context Length: 128,000 tokens
License: MIT (Commercial use permitted)
Modality: Text-only
Inference Modes: Supports "Thinking" (complex reasoning/tool use) and "Non-Thinking" (fast response) modes.

Capabilities and Practitioner Use Cases

Advanced Reasoning and Logic

Agentic Tool Use

Multilingual Coding

Running GLM-4.5 Locally: Hardware Requirements

VRAM Requirements and Quantization

4-bit Quantization (Q4_K_M): Requires approximately 200GB - 220GB of VRAM. This is the recommended "sweet spot" for maintaining intelligence while fitting on high-end hardware.
8-bit Quantization (FP8/Q8): Requires approximately 380GB - 400GB of VRAM. Recommended only for those with multi-node H100/A100 setups or massive Mac Studio configurations.

Recommended Hardware Configurations

Apple Silicon: An M2 Ultra or M4 Max Mac Studio with 192GB+ of Unified Memory is the most accessible way to run GLM-4.5. While a 192GB machine will be tight for 4-bit, it can run lower quantizations (Q3_K_L) relatively comfortably.
NVIDIA Multi-GPU: A setup with 8x RTX 4090 (24GB each) provides 192GB of VRAM. Using EXL2 or GGUF formats, you can run a 3.5-bit or 4-bit quantization.
Data Center Grade: 4x A6000 (48GB each) or 4x L40S cards are ideal for a stable 4-bit local deployment.