
A 355B frontier-scale AI optimized heavily for polished UI generation, front-end development, and 200K token context processing.
Copy and paste this command to start running the model locally.
ollama run glm-4.6:cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 63.5 GB | Low | |
| Q4_K_MRecommended | 70.3 GB | Good | |
| Q5_K_M | 73.5 GB | Very Good | |
| Q6_K | 77.3 GB | Excellent | |
| Q8_0 | 85.3 GB | Near Perfect | |
| FP16 | 115.7 GB | Full |
See which devices can run this model and at what quality level.
| SS | 42.4 tok/s | 70.3 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 55.0 tok/s | 70.3 GB | |
| SS | 60.7 tok/s | 70.3 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 91.7 tok/s | 70.3 GB | |
| SS | 68.7 tok/s | 70.3 GB | ||
| SS | 91.7 tok/s | 70.3 GB | ||
Google Cloud TPU v5pGoogle | SS | 31.7 tok/s | 70.3 GB | |
| SS | 81.3 tok/s | 70.3 GB | ||
| SS | 81.3 tok/s | 70.3 GB | ||
| SS | 81.3 tok/s | 70.3 GB | ||
| SS | 81.3 tok/s | 70.3 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 81.3 tok/s | 70.3 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 81.3 tok/s | 70.3 GB | |
| SS | 28.1 tok/s | 70.3 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | SS | 38.4 tok/s | 70.3 GB | |
NVIDIA A100 SXM4 80GBNVIDIA | AA | 23.4 tok/s | 70.3 GB | |
| AA | 9.2 tok/s | 70.3 GB | ||
| BB | 7.0 tok/s | 70.3 GB | ||
| BB | 7.0 tok/s | 70.3 GB | ||
| BB | 7.0 tok/s | 70.3 GB | ||
| BB | 6.3 tok/s | 70.3 GB | ||
| BB | 6.3 tok/s | 70.3 GB | ||
| BB | 6.3 tok/s | 70.3 GB | ||
| BB | 6.3 tok/s | 70.3 GB | ||
| BB | 9.2 tok/s | 70.3 GB |
GLM-4.6 is a frontier-scale Mixture-of-Experts (MoE) model developed by Z.ai, designed to compete with top-tier proprietary models in reasoning, coding, and complex agentic workflows. With a total parameter count of 355B, it represents a significant scale-up in capability from previous iterations, specifically optimized for high-fidelity front-end development and deep reasoning tasks. Unlike dense models of similar scale, GLM-4.6 utilizes an MoE architecture that activates only 32B parameters per token, making it a viable candidate for local deployment on high-end workstation hardware or multi-GPU nodes.
Released under the MIT license, GLM-4.6 offers a permissive alternative to closed-source models like Claude 3.5 Sonnet and GPT-4o. It is engineered for practitioners who require a massive 200,000-token context window for processing entire codebases or long-form technical documentation. The model excels in "visually polished" UI generation, making it a preferred choice for developers using agentic coding tools like Cline, Roo Code, or Kilo Code who want to run their backend locally to maintain privacy and reduce API latency.
The defining technical characteristic of GLM-4.6 is its 355B Mixture-of-Experts (MoE) architecture. In this setup, the model contains 355 billion total parameters, but only 32 billion are active during any single forward pass. This provides a "best of both worlds" scenario for local practitioners: the model possesses the vast knowledge base and reasoning depth of a 300B+ parameter model, but exhibits the inference latency and compute requirements closer to a 30B-40B dense model.
Key architectural specs include:
The expanded 200K context window is a significant upgrade over the 128K limit of GLM-4.5. For local engineers, this enables "Needle In A Haystack" retrieval across massive datasets and allows for complex RAG (Retrieval-Augmented Generation) implementations where the entire retrieved context can fit within the KV cache.
GLM-4.6 is positioned as a specialist in technical and creative execution. While it maintains generalist capabilities, its training focus has clearly shifted toward the "Agentic" era of AI.
One of the most specific claims by Z.ai is GLM-4.6’s ability to generate "visually polished" front-end pages. While many LLMs can write functional React or Tailwind code, GLM-4.6 is tuned for aesthetic coherence and modern UI/UX patterns. This makes it an ideal backend for local web development agents.
The model supports "thinking mode" for complex reasoning and is optimized for tool-calling. In benchmark tests like GPQA and AIME 2025, it shows high proficiency in graduate-level scientific reasoning and competition-level mathematics. For local practitioners, this means the model can be reliably integrated into loops where it must interact with local file systems, compilers, or web search tools.
With 200K tokens, GLM-4.6 can ingest several large source files simultaneously. This is critical for:
Running a 355B model locally is a significant hardware challenge, even with an MoE architecture. While the compute (active parameters) is efficient, the memory (total parameters) is not. You must fit the weights for all 355B parameters into VRAM or System RAM to avoid massive performance degradation.
To run GLM-4.6, your primary bottleneck is VRAM. Because the model is 355B parameters, a 16-bit (FP16) deployment would require over 700GB of VRAM—well beyond consumer reach. Quantization is mandatory.
For most practitioners, Q4_K_M is the "gold standard" for balancing intelligence and size. However, if you are running on a consumer multi-GPU setup, EXL2 or GGUF (via llama.cpp) at 3.0-bpw to 3.5-bpw is often the sweet spot to keep the model within 144GB of VRAM (6x 3090/4090).
On a Mac Studio M2 Ultra, you can expect roughly 5-8 tokens per second at 4-bit quantization. On a multi-GPU Linux build (8x 4090) using vLLM or Aphrodite Engine, speeds can reach 15-25 tokens per second due to the MoE efficiency and high memory bandwidth.
The quickest way to run GLM-4.6 locally is via Ollama.
ollama run glm4.6 (Note: Check for specific community tags for different quantization levels).GLM-4.6 occupies a unique space between "medium" MoE models like Mixtral 8x7B and "massive" models like DeepSeek-V3 or Grok-1.
For developers building local-first AI agents or engineers needing a high-reasoning model that respects an MIT license, GLM-4.6 is currently one of the most capable 300B+ class models available for local deployment.