NVIDIA

NVIDIA GB200 NVL72 Rack System

Name: NVIDIA GB200 NVL72 Rack System
Brand: NVIDIA
Price: 3000000 USD
Availability: InStock

NVIDIA's rack-scale AI supercomputer connecting 72 B200 GPUs and 36 Grace CPUs via NVLink 5 at 1.8 TB/s per GPU. The building block for frontier model training at hyperscale data centers.

NVIDIA GPUsIn Stock

EnterpriseData CenterBest for LLMsHigh Throughput

Buy on Amazon$3,000,000Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM13824 GB

Max ParamsTrillion+ parameter frontier models

GPUs72x NVIDIA B200 (Blackwell)

CPUs36x NVIDIA Grace (Arm)

Total GPU Memory13,824 GB HBM3e

NVLink per GPU1,800 GB/s (NVLink 5)

FP4 Performance1,440 PFLOPS (aggregate)

InterconnectNVLink 5.0 + NVLink Switch

CoolingLiquid cooling (required)

Form FactorFull rack

Our Take

Best for: Datacenter inference for flagship dense models

Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume.

Pair this withDeepSeek-V4-Pro (1600B)Largest popular open model that fits at Q4 — needs roughly 420.9 GB on this 13824 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

The NVIDIA GB200 NVL72 Rack System represents the current ceiling of compute density for AI infrastructure. It is not a component; it is a rack-scale supercomputer designed to function as a single logical GPU. By integrating 72 B200 Blackwell GPUs and 36 Grace CPUs into a unified fabric via NVLink 5, NVIDIA has created a system specifically engineered to solve the memory wall and interconnect bottlenecks that plague trillion-parameter model training and real-time inference.

In the hierarchy of NVIDIA GPUs for AI development, the NVL72 sits at the absolute top of the enterprise tier. While a single H100 or B200 might suffice for fine-tuning or small-scale inference, the GB200 NVL72 is built for organizations deploying frontier-class models like Llama 3.1 405B or DeepSeek-V3 at massive scale. It competes primarily with custom hyperscaler silicon (like Google’s TPU v5p) and AMD’s Instinct MI325X clusters, but maintains a distinct lead in software ecosystem maturity and interconnect bandwidth.

AI Performance & Specifications

The defining metric of the NVIDIA GB200 NVL72 Rack System for AI is its aggregate memory capacity and bandwidth. With 13,824 GB of HBM3e VRAM, this system eliminates the need to split massive models across multiple physical nodes over slow InfiniBand or Ethernet links for the majority of workloads. Instead, the entire 13.8 TB pool is accessible via NVLink 5 at a staggering 1.8 TB/s per GPU.

Key Technical Specifications:

Total GPU Memory: 13,824 GB HBM3e
Aggregate FP4 Performance: 1,440 PFLOPS (optimized for Blackwell’s second-generation Transformer Engine)
Interconnect: NVLink 5.0 delivering 130 TB/s of aggregate bandwidth across the rack
CPUs: 36x NVIDIA Grace (Arm-based), providing high-efficiency data orchestration
Cooling: Mandatory liquid cooling (Direct-to-Chip)
MSRP: $3,000,000

For practitioners, the NVIDIA GB200 NVL72 Rack System AI inference performance is driven by the transition to FP4 precision. The Blackwell architecture’s Transformer Engine dynamically manages precision, allowing for a 30x increase in inference speed compared to the H100 for LLM workloads. This makes it the premier 13824GB GPU for AI when calculating the total cost of ownership (TCO) per million tokens.

What Models Can It Run?

The GB200 NVL72 is the definitive hardware for running Trillion+ parameter frontier models. While local LLM enthusiasts might look at individual cards, this rack system is designed for the most demanding "agentic" workflows and massive-scale deployments.

Frontier Model Capability

Llama 3.1 405B: Can run entirely in-memory with massive KV cache head-room. At FP8 or even BF16, this system can serve thousands of concurrent users without breaking a sweat.
DeepSeek-V3 / DeepSeek-R1: These Mixture-of-Experts (MoE) models thrive on the NVLink 5 bandwidth. The GB200 NVL72 can handle the massive parameter counts of DeepSeek's latest releases while maintaining high tokens per second due to the reduced latency between the 72 GPU nodes.
Trillion-Parameter Models: This is the "sweet spot" for the NVL72. It is designed to run models with 1T to 2T parameters in a single rack, which previously required multiple racks of H100s.

Quantization and Throughput

While the system supports FP16 and BF16, the NVIDIA GB200 NVL72 Rack System VRAM for large language models is best utilized at FP4 or FP8. Utilizing FP4 allows for significantly higher throughput (tokens/sec) for real-time AI agents. For long-context tasks (128k+ tokens), the 13.8 TB of HBM3e allows for massive KV caches, enabling complex multi-turn reasoning without the performance degradation seen on lesser hardware.

Use Cases & Target Audience

This is not a system for "local" deployment in the traditional sense of a home office; it is the best AI chip for local deployment within private enterprise data centers or sovereign AI clouds.

Frontier Model Labs: Researchers training or fine-tuning the next generation of 1T+ parameter models.
Enterprise Inference Servers: Large-scale organizations (Finance, Healthcare, Tech) that need to run Llama 3.1 405B or custom MoE models internally for data privacy.
Agentic Workflow Infrastructure: Teams building autonomous agents that require low-latency, high-throughput inference to process complex reasoning chains in real-time.
Synthetic Data Generation: The massive FP4 throughput makes this the best AI GPU for agent training and generating trillions of tokens of synthetic data to train smaller, specialized models.

For those looking for the best nvidia gpus for running AI models locally at a workstation level, the NVIDIA RTX 6000 Ada or the upcoming Blackwell-based PCIe cards are more appropriate. The NVL72 is strictly for hyperscale and high-density enterprise environments.

How It Compares

When evaluating the NVIDIA GB200 NVL72 Rack System vs. AMD Instinct MI300X/MI325X, the primary differentiator is the interconnect. While AMD offers impressive raw VRAM and memory bandwidth per OAM module, NVIDIA’s NVLink Switch System in the NVL72 allows all 72 GPUs to communicate as if they were a single unit. This drastically reduces the "all-reduce" overhead during training and the latency during inference for MoE models.

Comparison Table: High-End AI Infrastructure

Feature	NVIDIA GB200 NVL72	AMD Instinct MI325X Cluster	NVIDIA H100 (8-GPU HGX)
Total VRAM	13,824 GB	Variable (256GB per GPU)	640 GB
Interconnect Speed	1.8 TB/s (NVLink 5)	896 GB/s (Infinity Fabric)	900 GB/s (NVLink 4)
Best For	Trillion+ Parameter Models	High-throughput FP16/BF16	Small-Mid Scale Training
Cooling	Liquid (Required)	Air/Liquid	Air/Liquid
Architecture	Blackwell (FP4 optimized)	CDNA 3	Hopper

The GB200 NVL72 is the clear choice for practitioners who cannot afford the latency penalties of InfiniBand between nodes. If your workload involves NVIDIA nvidia gpus for AI development at the absolute frontier of what is possible in 2025, the NVL72 is the industry standard. While the $3M price tag is steep, the efficiency gains in tokens-per-watt and tokens-per-dollar for trillion-parameter models make it the most viable path for serious AI infrastructure.

Compatible AI Models

Specs not available for scoring. This product is missing VRAM or memory bandwidth data.