Z.ai

GLM-4.6

A 355B frontier-scale AI optimized heavily for polished UI generation, front-end development, and 200K token context processing.

355B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Creative Writing

Model Specifications

Parameters355B

Active Params32B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size713.6 GB

Community

Monthly Downloads35.7K

Likes1.2K

Last Updated6 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-4.6:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

Terminal Bench

24.5

SWE-Pro

9.7

Arena Score

91.0

Overall Score

32.2DD

Benchmark40%

41.7

Popularity25%

23.3

Efficiency20%

13.0

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	63.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	70.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	73.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	77.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	85.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	115.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Intel Gaudi 3 AI AcceleratorIntel	SS	42.4 tok/s	70.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	55.0 tok/s	70.3 GB
AMD Instinct MI300XAMD	SS	60.7 tok/s	70.3 GB
NVIDIA B200 GPUNVIDIA	SS	91.7 tok/s	70.3 GB
AMD Instinct MI325XAMD	SS	68.7 tok/s	70.3 GB
AMD Instinct MI355XAMD	SS	91.7 tok/s	70.3 GB
Google Cloud TPU v5pGoogle	SS	31.7 tok/s	70.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	81.3 tok/s	70.3 GB
Dell Pro Max with GB300Dell	SS	81.3 tok/s	70.3 GB
HP ZGX Fury AI StationHP	SS	81.3 tok/s	70.3 GB
MSI XpertStation WS300MSI	SS	81.3 tok/s	70.3 GB
SuperMicro Super AI StationSuperMicro	SS	81.3 tok/s	70.3 GB
Gigabyte W775-V10-L01Gigabyte	SS	81.3 tok/s	70.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	28.1 tok/s	70.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	38.4 tok/s	70.3 GB
NVIDIA A100 SXM4 80GBNVIDIA	AA	23.4 tok/s	70.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	9.2 tok/s	70.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	7.0 tok/s	70.3 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	7.0 tok/s	70.3 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	7.0 tok/s	70.3 GB
Apple M4 Max (40-core GPU)Apple	BB	6.3 tok/s	70.3 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	6.3 tok/s	70.3 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	6.3 tok/s	70.3 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	6.3 tok/s	70.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	9.2 tok/s	70.3 GB

Rows per page

Page 1 of 4

About This Model

Overview

GLM-4.6 is a frontier-scale Mixture-of-Experts (MoE) model developed by Z.ai, designed to compete with top-tier proprietary models in reasoning, coding, and complex agentic workflows. With a total parameter count of 355B, it represents a significant scale-up in capability from previous iterations, specifically optimized for high-fidelity front-end development and deep reasoning tasks. Unlike dense models of similar scale, GLM-4.6 utilizes an MoE architecture that activates only 32B parameters per token, making it a viable candidate for local deployment on high-end workstation hardware or multi-GPU nodes.

Released under the MIT license, GLM-4.6 offers a permissive alternative to closed-source models like Claude 3.5 Sonnet and GPT-4o. It is engineered for practitioners who require a massive 200,000-token context window for processing entire codebases or long-form technical documentation. The model excels in "visually polished" UI generation, making it a preferred choice for developers using agentic coding tools like Cline, Roo Code, or Kilo Code who want to run their backend locally to maintain privacy and reduce API latency.

Architecture & Technical Details

The defining technical characteristic of GLM-4.6 is its 355B Mixture-of-Experts (MoE) architecture. In this setup, the model contains 355 billion total parameters, but only 32 billion are active during any single forward pass. This provides a "best of both worlds" scenario for local practitioners: the model possesses the vast knowledge base and reasoning depth of a 300B+ parameter model, but exhibits the inference latency and compute requirements closer to a 30B-40B dense model.

Key architectural specs include:

Total Parameters: 355B
Active Parameters: 32B
Context Window: 200,000 tokens
Architecture Type: Decoder-only Transformer with MoE
License: MIT (Permissive)

The expanded 200K context window is a significant upgrade over the 128K limit of GLM-4.5. For local engineers, this enables "Needle In A Haystack" retrieval across massive datasets and allows for complex RAG (Retrieval-Augmented Generation) implementations where the entire retrieved context can fit within the KV cache.

Capabilities & Use Cases

GLM-4.6 is positioned as a specialist in technical and creative execution. While it maintains generalist capabilities, its training focus has clearly shifted toward the "Agentic" era of AI.

Front-End & Polished UI Generation

One of the most specific claims by Z.ai is GLM-4.6’s ability to generate "visually polished" front-end pages. While many LLMs can write functional React or Tailwind code, GLM-4.6 is tuned for aesthetic coherence and modern UI/UX patterns. This makes it an ideal backend for local web development agents.

Advanced Reasoning and Tool Use

The model supports "thinking mode" for complex reasoning and is optimized for tool-calling. In benchmark tests like GPQA and AIME 2025, it shows high proficiency in graduate-level scientific reasoning and competition-level mathematics. For local practitioners, this means the model can be reliably integrated into loops where it must interact with local file systems, compilers, or web search tools.

Long-Context Agentic Tasks

With 200K tokens, GLM-4.6 can ingest several large source files simultaneously. This is critical for:

Refactoring: Passing 10-15 related modules to understand dependencies before suggesting changes.
Documentation Analysis: Summarizing or querying massive PDF technical manuals without losing context.
Role-playing: Maintaining deep character consistency and world-state in creative writing over long sessions.

Running GLM-4.6 Locally

Running a 355B model locally is a significant hardware challenge, even with an MoE architecture. While the compute (active parameters) is efficient, the memory (total parameters) is not. You must fit the weights for all 355B parameters into VRAM or System RAM to avoid massive performance degradation.

GLM-4.6 Hardware Requirements

To run GLM-4.6, your primary bottleneck is VRAM. Because the model is 355B parameters, a 16-bit (FP16) deployment would require over 700GB of VRAM—well beyond consumer reach. Quantization is mandatory.

4-bit Quantization (Q4_K_M): Requires ~200GB - 210GB of VRAM.
Hardware: 2x Mac Studio (M2/M3/M4 Ultra with 192GB+ RAM) or a server node with 8x RTX 3090/4090 (24GB each).
1.5-bit to 2-bit Quantization (IQ2_XS / Q2_K): Requires ~85GB - 110GB of VRAM.
Hardware: A single Mac Studio with 128GB Unified Memory or a multi-GPU PC with 4x-5x RTX 3090/4090s.

Best Quantization for GLM-4.6

For most practitioners, Q4_K_M is the "gold standard" for balancing intelligence and size. However, if you are running on a consumer multi-GPU setup, EXL2 or GGUF (via llama.cpp) at 3.0-bpw to 3.5-bpw is often the sweet spot to keep the model within 144GB of VRAM (6x 3090/4090).

Expected Performance (Tokens per Second)

On a Mac Studio M2 Ultra, you can expect roughly 5-8 tokens per second at 4-bit quantization. On a multi-GPU Linux build (8x 4090) using vLLM or Aphrodite Engine, speeds can reach 15-25 tokens per second due to the MoE efficiency and high memory bandwidth.

How to Run

The quickest way to run GLM-4.6 locally is via Ollama.

Ensure you have the necessary VRAM/RAM available.
Run ollama run glm4.6 (Note: Check for specific community tags for different quantization levels).
For heavy-duty dev work, use vLLM to take advantage of PagedAttention and MoE-specific kernels.

How It Compares

GLM-4.6 occupies a unique space between "medium" MoE models like Mixtral 8x7B and "massive" models like DeepSeek-V3 or Grok-1.

GLM-4.6 vs. DeepSeek-V3.1 (671B): DeepSeek is nearly double the total size and generally more powerful in raw coding logic, but GLM-4.6 is significantly easier to fit on mid-tier workstation hardware (200GB VRAM vs 400GB+ VRAM for Q4). GLM-4.6 often feels "snappier" in UI generation and creative writing style.
GLM-4.6 vs. Llama 3.1 405B: Llama 3.1 405B is a dense model, meaning every token requires 405B parameters worth of compute. GLM-4.6 is much faster for inference because it only activates 32B parameters per token. If you have the RAM to hold the weights but lack massive H100 clusters, GLM-4.6 will give you much higher tokens per second.
GLM-4.6 vs. GLM-4.5: The jump to 4.6 is primarily about context (128K to 200K) and "agentic" reliability. If your workflow involves long-running agents or complex tool-use chains, the 4.6 upgrade is essential for reducing hallucination in long contexts.

For developers building local-first AI agents or engineers needing a high-reasoning model that respects an MIT license, GLM-4.6 is currently one of the most capable 300B+ class models available for local deployment.

Related Models

Z.ai

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Z.ai

GLM-4.6

A 355B frontier-scale AI optimized heavily for polished UI generation, front-end development, and 200K token context processing.

355B paramsMoE200K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Creative Writing

Model Specifications

Parameters355B

Active Params32B

ArchitectureMoE

Context Length200K tokens

ModalityText Only

ProviderZ.ai

Download Size713.6 GB

Community

Monthly Downloads35.7K

Likes1.2K

Last Updated6 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run glm-4.6:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MIT

Performance & Scoring

Benchmarks

Terminal Bench

24.5

SWE-Pro

9.7

Arena Score

91.0

Overall Score

32.2DD

Benchmark40%

41.7

Popularity25%

23.3

Efficiency20%

13.0

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	63.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	70.3 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	73.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	77.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	85.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	115.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Intel Gaudi 3 AI AcceleratorIntel	SS	42.4 tok/s	70.3 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	55.0 tok/s	70.3 GB
AMD Instinct MI300XAMD	SS	60.7 tok/s	70.3 GB
NVIDIA B200 GPUNVIDIA	SS	91.7 tok/s	70.3 GB
AMD Instinct MI325XAMD	SS	68.7 tok/s	70.3 GB
AMD Instinct MI355XAMD	SS	91.7 tok/s	70.3 GB
Google Cloud TPU v5pGoogle	SS	31.7 tok/s	70.3 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	81.3 tok/s	70.3 GB
Dell Pro Max with GB300Dell	SS	81.3 tok/s	70.3 GB
HP ZGX Fury AI StationHP	SS	81.3 tok/s	70.3 GB
MSI XpertStation WS300MSI	SS	81.3 tok/s	70.3 GB
SuperMicro Super AI StationSuperMicro	SS	81.3 tok/s	70.3 GB
Gigabyte W775-V10-L01Gigabyte	SS	81.3 tok/s	70.3 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	28.1 tok/s	70.3 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	38.4 tok/s	70.3 GB
NVIDIA A100 SXM4 80GBNVIDIA	AA	23.4 tok/s	70.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	9.2 tok/s	70.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	7.0 tok/s	70.3 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	7.0 tok/s	70.3 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	7.0 tok/s	70.3 GB
Apple M4 Max (40-core GPU)Apple	BB	6.3 tok/s	70.3 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	6.3 tok/s	70.3 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	6.3 tok/s	70.3 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	6.3 tok/s	70.3 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	9.2 tok/s	70.3 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Key architectural specs include:

Total Parameters: 355B
Active Parameters: 32B
Context Window: 200,000 tokens
Architecture Type: Decoder-only Transformer with MoE
License: MIT (Permissive)

Capabilities & Use Cases

GLM-4.6 is positioned as a specialist in technical and creative execution. While it maintains generalist capabilities, its training focus has clearly shifted toward the "Agentic" era of AI.

Front-End & Polished UI Generation

Advanced Reasoning and Tool Use

Long-Context Agentic Tasks

With 200K tokens, GLM-4.6 can ingest several large source files simultaneously. This is critical for:

Refactoring: Passing 10-15 related modules to understand dependencies before suggesting changes.
Documentation Analysis: Summarizing or querying massive PDF technical manuals without losing context.
Role-playing: Maintaining deep character consistency and world-state in creative writing over long sessions.

Running GLM-4.6 Locally

GLM-4.6 Hardware Requirements

4-bit Quantization (Q4_K_M): Requires ~200GB - 210GB of VRAM.
Hardware: 2x Mac Studio (M2/M3/M4 Ultra with 192GB+ RAM) or a server node with 8x RTX 3090/4090 (24GB each).
1.5-bit to 2-bit Quantization (IQ2_XS / Q2_K): Requires ~85GB - 110GB of VRAM.
Hardware: A single Mac Studio with 128GB Unified Memory or a multi-GPU PC with 4x-5x RTX 3090/4090s.

Best Quantization for GLM-4.6

Expected Performance (Tokens per Second)

How to Run

The quickest way to run GLM-4.6 locally is via Ollama.

Ensure you have the necessary VRAM/RAM available.
Run ollama run glm4.6 (Note: Check for specific community tags for different quantization levels).
For heavy-duty dev work, use vLLM to take advantage of PagedAttention and MoE-specific kernels.

How It Compares

GLM-4.6 occupies a unique space between "medium" MoE models like Mixtral 8x7B and "massive" models like DeepSeek-V3 or Grok-1.

GLM-4.6 vs. DeepSeek-V3.1 (671B): DeepSeek is nearly double the total size and generally more powerful in raw coding logic, but GLM-4.6 is significantly easier to fit on mid-tier workstation hardware (200GB VRAM vs 400GB+ VRAM for Q4). GLM-4.6 often feels "snappier" in UI generation and creative writing style.
GLM-4.6 vs. Llama 3.1 405B: Llama 3.1 405B is a dense model, meaning every token requires 405B parameters worth of compute. GLM-4.6 is much faster for inference because it only activates 32B parameters per token. If you have the RAM to hold the weights but lack massive H100 clusters, GLM-4.6 will give you much higher tokens per second.
GLM-4.6 vs. GLM-4.5: The jump to 4.6 is primarily about context (128K to 200K) and "agentic" reliability. If your workflow involves long-running agents or complex tool-use chains, the 4.6 upgrade is essential for reducing hallucination in long contexts.

Related Models

Z.ai

GLM-5

744BMoE

Z.ai

GLM-5.1

744BMoE

Z.ai

GLM-4.7

358BMoE

Z.ai

GLM-4.5

355BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.