NVIDIA

NVIDIA H100 SXM5 80GB

Name: NVIDIA H100 SXM5 80GB
Brand: NVIDIA
Price: 30000 USD
Availability: InStock

NVIDIA's Hopper-architecture data center GPU with 80GB HBM3. The industry standard for large-scale AI training and inference, powering most of the world's frontier AI models.

NVIDIA GPUsIn Stock

Best for LLMsEnterpriseData CenterHigh ThroughputProduction Ready

Buy on Amazon$30,000Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM80 GB

FP16989.4 TFLOPS

INT81979 TOPS

TDP700 W

Memory BW3350 GB/s

Max Params70B at FP16, 405B at Q4 (multi-GPU)

ArchitectureHopper (GH100)

CUDA Cores16,896

Tensor Cores528 (4th gen)

Memory TypeHBM3

Memory Bus5120-bit

Transistors80 billion

Process NodeTSMC 4N

NVLink Bandwidth900 GB/s

InterconnectNVLink 4.0, PCIe 5.0

Form FactorSXM5

Transformer EngineYes (FP8)

Our Take

Best for: Datacenter inference for flagship dense models

Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume. Notably efficient for its compute class — strong perf-per-watt makes it a natural pick for always-on inference.

Pair this withKimi K2 Instruct (1000B)Largest popular open model that fits at Q4 — needs roughly 51.8 GB on this 80 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

The NVIDIA H100 SXM5 80GB is the definitive silicon for the current era of generative AI. Built on the Hopper (GH100) architecture and manufactured on the TSMC 4N process, this data center GPU is engineered specifically to accelerate the Transformer-based architectures that power modern LLMs. While consumer cards focus on rasterization, the H100 is built for high-throughput tensor calculations and massive memory bandwidth, making it the industry standard for both frontier model training and high-concurrency inference.

Positioned as the flagship of the NVIDIA GPUs lineup, the H100 SXM5 is a high-density compute module designed for integration into HGX boards. It competes primarily with the AMD Instinct MI300X and NVIDIA’s own Blackwell-series successors. For AI engineers and researchers, the H100 SXM5 80GB for AI development represents the most stable, well-supported, and high-performance environment available, benefiting from a decade of CUDA optimization and the introduction of the dedicated Transformer Engine.

AI Performance & Specifications

The H100 SXM5 is defined by its ability to move data. While its 989.4 TFLOPS of FP16 performance is impressive, the real-world bottleneck for LLM inference is often memory bandwidth. The H100 utilizes 80GB of HBM3 memory, delivering a massive 3350 GB/s of bandwidth. This allows for significantly higher tokens per second compared to the PCIe variant of the H100 or consumer-grade cards like the RTX 4090, which tops out at 1008 GB/s.

Key technical specifications for AI workloads include:

VRAM: 80 GB HBM3
Memory Bandwidth: 3350 GB/s
FP16 Performance: 989.4 TFLOPS
INT8 Performance: 1979 TOPS
Transformer Engine: Hardware-level support for FP8, doubling throughput for compatible models.
NVLink Bandwidth: 900 GB/s (essential for multi-GPU scaling)
TDP: 700 W

The 700W TDP is a critical consideration for those looking at NVIDIA H100 SXM5 80GB local LLM deployments. Unlike PCIe cards, the SXM5 form factor requires specialized server chassis with robust cooling and power delivery systems. However, this power draw is offset by its efficiency; the H100 can deliver up to 30x the performance of the previous generation A100 in certain Transformer-based inference tasks.

What Models Can It Run?

The NVIDIA H100 SXM5 80GB VRAM for large language models allows for a wide range of deployment scenarios, from single-card inference to massive multi-GPU clusters. When evaluating hardware for running 70B at FP16, 405B at Q4 (multi-GPU) parameter models, the H100 is the baseline.

Single-GPU Capabilities

On a single H100 80GB, you can comfortably run:

Llama 3.1 8B / Mistral 7B / Qwen 2.5 7B: These models fit entirely in VRAM at FP16 with massive KV cache headroom. You can expect extreme throughput, often exceeding 150+ tokens per second depending on the inference engine (vLLM, TensorRT-LLM).
Llama 3.1 70B / DeepSeek-V3 (Distilled): A 70B model at 4-bit quantization (GPTQ/AWQ) takes up approximately 40GB of VRAM. This leaves 40GB for KV cache, making the H100 the best AI chip for local deployment of high-throughput 70B agents.
Command R / Mixtral 8x7B: These MoE (Mixture of Experts) models fit well within the 80GB envelope at 4-bit or 8-bit quantization, benefiting from the H100’s high memory bandwidth during expert switching.

Multi-GPU Scaling

For the largest frontier models, the 900 GB/s NVLink interconnect is the H100's "killer feature."

Llama 3.1 405B: At 4-bit quantization (Q4), this model requires roughly 230GB of VRAM. An 8-GPU H100 SXM5 node handles this with ease, providing enough overhead for long-context windows (128k+ tokens).
DeepSeek-R1: Full-scale reasoning models benefit from the H100’s FP8 support via the Transformer Engine, which reduces the memory footprint and increases throughput without significant loss in reasoning accuracy.

The "sweet spot" for this hardware is often FP8 or AWQ 4-bit quantization. Using NVIDIA's TensorRT-LLM, practitioners can leverage the Transformer Engine to achieve FP8 precision, which offers nearly identical accuracy to FP16 but with a 2x boost in NVIDIA H100 SXM5 80GB tokens per second.

Use Cases & Target Audience

The H100 SXM5 is not a "hobbyist" card in the traditional sense; it is a production-grade tool for those building at scale.

Teams Running Inference Servers: For organizations deploying AI-powered agents, the H100 provides the highest density of "concurrency per watt." It is the best hardware for local AI agents in 2025 where low latency and high request volume are required.
AI Researchers & Developers: If you are fine-tuning models or training LoRAs on large datasets, the 989.4 TFLOPS of FP16 performance significantly reduces training wall-clock time compared to A100 or consumer hardware.
Enterprise Production: For businesses that cannot leak data to third-party APIs, an H100-based local cluster is the gold standard for private, high-performance RAG (Retrieval-Augmented Generation) pipelines.
Agentic Workflows: Local AI agents require rapid "thinking" (token generation) to function effectively in loops. The H100’s memory bandwidth ensures that the bottleneck is the model's logic, not the hardware's ability to feed the weights to the cores.

How It Compares

When choosing the best NVIDIA GPUs for running AI models locally or in a private cloud, the H100 SXM5 is often compared to the NVIDIA A100 80GB and the AMD Instinct MI300X.

H100 SXM5 vs. A100 80GB

The A100 was the previous king of the data center. While it also has 80GB of VRAM, the H100 offers:

3x higher FP16 throughput.
The Transformer Engine: A100 lacks native FP8 hardware acceleration.
Higher Bandwidth: 3350 GB/s (H100) vs 2039 GB/s (A100).
Verdict: Choose the H100 for any new deployment involving Transformers; the A100 is only preferable if budget constraints are extreme and the workload is not Transformer-based.

H100 SXM5 vs. AMD Instinct MI300X

The MI300X is a formidable competitor in the NVIDIA vs AMD for AI inference debate.

VRAM: MI300X offers 192GB of HBM3, significantly more than the H100’s 80GB.
Software: NVIDIA still holds a massive lead with the CUDA ecosystem, TensorRT-LLM, and day-one support for new model architectures. AMD’s ROCm has improved but requires more effort for optimization.
Verdict: The MI300X is excellent for massive models that need to fit on fewer GPUs, but the H100 remains the "safe" choice for production reliability and software compatibility.

For practitioners looking for the best AI GPU for agent training and high-scale inference, the NVIDIA H100 SXM5 80GB remains the benchmark by which all other AI hardware is measured. Its combination of HBM3 bandwidth, Transformer Engine acceleration, and mature software stack makes it the premier choice for 2025's most demanding agentic and LLM workloads.

Compatible AI Models

Hide F tierOnly popular models

56 models


Llama 2 70B ChatMeta	70B	SS	62.1 tok/s	43.4 GB
Llama 3 70B InstructMeta	70B	SS	59.0 tok/s	45.7 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	SS	61.9 tok/s	43.6 GB
GLM-4.5Z.ai	355B(32B active)	SS	52.0 tok/s	51.8 GB
GLM-4.7Z.ai	358B(32B active)	SS	51.3 tok/s	52.6 GB
Qwen 3.5 OmniAlibaba	397B(17B active)	SS	59.7 tok/s	45.2 GB
Qwen3.5-397B-A17BAlibaba	397B(17B active)	SS	58.6 tok/s	46.0 GB
DeepSeek-V3DeepSeek	671B(37B active)	SS	45.1 tok/s	59.8 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
DeepSeek-R1DeepSeek	671B(37B active)	SS	45.1 tok/s	59.8 GB
DeepSeek-V3.1DeepSeek	671B(37B active)	SS	45.1 tok/s	59.8 GB
DeepSeek-V3.2DeepSeek	685B(37B active)	SS	45.1 tok/s	59.8 GB
Kimi K2 InstructMoonshot AI	1000B(32B active)	SS	52.0 tok/s	51.8 GB
Qwen3-235B-A22BAlibaba	235B(22B active)	SS	74.2 tok/s	36.3 GB
Gemma 3 27B ITGoogle	27B	SS	61.6 tok/s	43.8 GB
Qwen3-32BAlibaba	32.8B	SS	50.0 tok/s	53.9 GB
Mistral Small 3 24BMistral AI	24B	SS	69.2 tok/s	39.0 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
LLaMA 65BMeta	65B	SS	68.7 tok/s	39.3 GB
Qwen3.5-122B-A10BAlibaba	122B(10B active)	SS	98.9 tok/s	27.3 GB
minimax-m2.5MiniMax	230B(10B active)	SS	118.8 tok/s	22.7 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	SS	237.3 tok/s	11.4 GB
Mistral Large 3 675BMistral AI	675B(41B active)	SS	40.7 tok/s	66.3 GB
Falcon 40B InstructTechnology Innovation Institute	40B	SS	110.7 tok/s	24.4 GB
Qwen3.5-9BAlibaba	9B	SS	109.6 tok/s	24.6 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	SS	244.9 tok/s	11.0 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
GLM-4.6Z.ai	355B(32B active)	SS	38.4 tok/s	70.3 GB

Rows per page

Page 1 of 3