Google

Google TPU v7 (Ironwood)

Google's seventh-generation Tensor Processing Unit, delivering 4,614 FP8 TFLOPS per chip with 192 GB HBM3e. Scales to 9,216-chip superpods producing 42.5 FP8 EFLOPS. Purpose-built for large-scale AI training and inference. Cloud-only via Google Cloud Platform.

Google TPUsIn Stock

Data CenterBest for LLMs

Read the Docs

Quick Specs

VRAM192 GB

FP162307 TFLOPS

TDP600 W

Memory BW7380 GB/s

Peak FP8 Compute4,614 TFLOPS per chip

Peak BF16 Compute2,307 TFLOPS per chip

HBM Capacity192 GB HBM3e per chip

HBM Bandwidth7.4 TB/s per chip

TensorCores per Chip2

SparseCores per Chip4

ICI Bandwidth9.6 Tbps bidirectional per chip

Max Pod Size9,216 chips

Pod Peak FP842.5 EFLOPS

Pod HBM Capacity1.77 PB

Interconnect Topology3D Torus

TDP600W per chip

CoolingLiquid

vCPUs per VM (4-chip)224

RAM per VM (4-chip)960 GB

Perf/Watt vs Trillium2x

Specifications

Overview

Google TPU v7 (Ironwood) is the seventh-generation Tensor Processing Unit, built exclusively for inference at data-center scale. Unlike earlier TPUs that split their focus between training and serving, Ironwood is architected for one job: turn a 70-billion-parameter model into a low-latency API endpoint. You can’t buy the chip—Google keeps every wafer in-house—but you can rent slices of 4 to 9,216 chips through Google Cloud Platform. In practice, that means 192 GB HBM3e and 4.6 FP8 PFLOPS per socket, liquid-cooled, wired into a 3-D torus that pushes 9.6 Tbps chip-to-chip. If your bottleneck is “how many concurrent users can hit Llama 3.1 70 B before the P99 latency explodes,” Ironwood is the closest thing to a turnkey answer.

The competitive set is thin: NVIDIA H100/H200/B100 for GPU-centric shops, AWS Inferentia2 for cost-at-all-costs, and AMD MI300X if you want 192 GB on a PCIe card. Ironwood’s pitch is simple—higher memory bandwidth (7.4 TB/s vs. 4.8 TB/s on H100 SXM) and 2× the perf/Watt of the prior-gen Trillium—so Google can offer lower per-token cost while keeping the QoS bar high.

AI Performance & Specifications

Per-chip numbers that matter:

192 GB HBM3e – fits a 70 B param model at FP8 or a 405 B param model at 4-bit without tensor-parallel sharding
7.4 TB/s memory bandwidth – ~38 TB/s effective after compression; keeps 128 k context lengths streaming
4,614 FP8 TFLOPS peak – real-world sustained ~3.8 PFLOPS at 85% utilisation in JAX/XLA workloads
600 W TDP, liquid-cooled – 7.7 TFLOPS/Watt in FP8, 2× Trillium, 1.4× H100 SXM

Pod-scale ceiling:

9,216 chips → 42.5 EFLOPS FP8, 1.77 PB HBM – serves 8 k concurrent 70 B sessions at <50 ms token latency
ICI 3-D torus – hop latency <1 µs, no external NICs needed until you leave the pod

Quantisation impact: FP8 is native; BF16 and INT8 paths are one compiler flag away. Google’s XLA compiler automatically fuses MoE top-k routing into the sparse cores, so Mixtral-8×22B runs at the same per-token energy as a dense 7 B model.

What Models Can It Run?

VRAM rule of thumb: 1 byte/parameter in FP8, 0.5 bytes in 4-bit, 0.25 bytes in 2-bit. With 192 GB you get:

Llama 3.1 70 B – FP8 whole-model resident, 128 k context, 2,850 tokens/s per chip (batch=256)
Llama 3.1 405 B – 4-bit quantized, 192 GB exact fit, 620 tokens/s per chip
DeepSeek-R1 671 B MoE – 37 B active params in FP8, 3,100 tokens/s per chip
Qwen2.5-72 B – INT8, 3,400 tokens/s; switch to FP8 for 2,850 tokens/s and full accuracy
Multimodal (Gemini 1.5 Pro) – 1 M token context, 200 k image+text input streams at 28 fps

Sweet spot: 4-bit weight-only quant on 70 B class models keeps quality within 0.3% of BF16 on MMLU while doubling throughput. Long-context jobs (≥200 k) stay in HBM without paging, so latency variance drops to sub-millisecond.

Use Cases & Target Audience

You choose Ironwood when your daily bill is measured in billions of tokens, not dollars per hour.

SaaS APIs – Serve fine-tuned 70 B chat models to millions of users; 9,216-chip pod handles 25 M QPS with P99 <100 ms
Batch inference – Overnight scoring of trillion-token corpora; 42.5 EFLOPS finishes 400 B param pre-training eval in 3 hours
Reasoning loops – Agentic workflows that chain multiple 70 B calls; high single-thread bandwidth keeps KV-cache hot
Cost-sensitive startups – Google Cloud committed-use contracts drop per-1k-token price below $0.0008, competitive with Inferentia2

Not for you if you need on-prem or edge: Ironwood never leaves Google’s liquid-cooled halls. If you want a 192 GB card in your own box, look at AMD MI300X or the upcoming NVIDIA B100.

How It Compares

NVIDIA H100 80 GB SXM

VRAM: 80 GB → needs 2× tensor-parallel for 70 B FP8, adds 8% latency
Bandwidth: 3.35 TB/s → 2.2× lower; context-switching drops throughput 35%
Perf/Watt: ~3.5 FP8 TFLOPS/W vs. 7.7 for Ironwood
Pick H100 if your stack is CUDA-native and you already own InfiniBand. Pick Ironwood for raw opex reduction at scale.

AWS Inferentia2 (Trn1)

VRAM: 32 GB per NeuronCore → sharding 70 B across 8 chips; compilation times measured in hours
Throughput: 1,900 tokens/s on 70 B—45% of Ironwood single-chip
Price: ~30% cheaper per chip-hour, but 2.2× more chips needed
Pick Inferentia2 for dev/QA where compile latency is tolerable. Pick Ironwood for production where time-to-first-token is revenue.

Bottom line: Ironwood is the fastest path to serve the largest open-weight models today, provided you’re willing to live inside Google’s cloud.

Compatible AI Models

Hide F tierOnly popular models

56 models


Llama 3.1 70B InstructMeta	70B	SS	52.7 tok/s	112.8 GB
Llama 3.3 70B InstructMeta	70B	SS	52.7 tok/s	112.8 GB
Nvidia Nemotron 3 SuperNVIDIA	120B(12B active)	SS	57.4 tok/s	103.5 GB
DeepSeek-V4-FlashDeepSeek	284B(13B active)	SS	53.0 tok/s	112.0 GB
Llama 4 MaverickMeta	400B(17B active)	SS	40.6 tok/s	146.4 GB
GLM-5Z.ai	744B(40B active)	SS	67.7 tok/s	87.7 GB
GLM-5.1Z.ai	744B(40B active)	SS	67.7 tok/s	87.7 GB
Kimi K2.6Moonshot AI	1000B(32B active)	SS	68.9 tok/s	86.2 GB
Kimi K2 Instruct 0905Moonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Kimi K2 ThinkingMoonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Kimi K2.5Moonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Falcon 180BTechnology Innovation Institute	180B	SS	55.1 tok/s	107.8 GB
GLM-4.6Z.ai	355B(32B active)	SS	84.6 tok/s	70.3 GB
Gemma 4 31B ITGoogle	31B	SS	72.5 tok/s	82.0 GB
Mistral Large 3 675BMistral AI	675B(41B active)	SS	89.7 tok/s	66.3 GB
Qwen3.6-27BAlibaba Cloud	27B	SS	81.6 tok/s	72.8 GB
Qwen3.5-27BAlibaba Cloud (Qwen)	27B	SS	81.6 tok/s	72.8 GB
DeepSeek-V3DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-R1DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-V3.1DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-V3.2DeepSeek	685B(37B active)	SS	99.3 tok/s	59.8 GB
GLM-4.7Z.ai	358B(32B active)	SS	112.9 tok/s	52.6 GB
GLM-4.5Z.ai	355B(32B active)	SS	114.6 tok/s	51.8 GB
Kimi K2 InstructMoonshot AI	1000B(32B active)	SS	114.6 tok/s	51.8 GB
Qwen3.5-397B-A17BAlibaba Cloud (Qwen)	397B(17B active)	SS	129.1 tok/s	46.0 GB

Rows per page

Page 1 of 3

Google TPU v7 (Ironwood)

Google TPUsIn Stock

Data CenterBest for LLMs

Read the Docs

Quick Specs

VRAM192 GB

FP162307 TFLOPS

TDP600 W

Memory BW7380 GB/s

Peak FP8 Compute4,614 TFLOPS per chip

Peak BF16 Compute2,307 TFLOPS per chip

HBM Capacity192 GB HBM3e per chip

HBM Bandwidth7.4 TB/s per chip

TensorCores per Chip2

SparseCores per Chip4

ICI Bandwidth9.6 Tbps bidirectional per chip

Max Pod Size9,216 chips

Pod Peak FP842.5 EFLOPS

Pod HBM Capacity1.77 PB

Interconnect Topology3D Torus

TDP600W per chip

CoolingLiquid

vCPUs per VM (4-chip)224

RAM per VM (4-chip)960 GB

Perf/Watt vs Trillium2x

Specifications

Overview

AI Performance & Specifications

Per-chip numbers that matter:

192 GB HBM3e – fits a 70 B param model at FP8 or a 405 B param model at 4-bit without tensor-parallel sharding
7.4 TB/s memory bandwidth – ~38 TB/s effective after compression; keeps 128 k context lengths streaming
4,614 FP8 TFLOPS peak – real-world sustained ~3.8 PFLOPS at 85% utilisation in JAX/XLA workloads
600 W TDP, liquid-cooled – 7.7 TFLOPS/Watt in FP8, 2× Trillium, 1.4× H100 SXM

Pod-scale ceiling:

9,216 chips → 42.5 EFLOPS FP8, 1.77 PB HBM – serves 8 k concurrent 70 B sessions at <50 ms token latency
ICI 3-D torus – hop latency <1 µs, no external NICs needed until you leave the pod

What Models Can It Run?

VRAM rule of thumb: 1 byte/parameter in FP8, 0.5 bytes in 4-bit, 0.25 bytes in 2-bit. With 192 GB you get:

Llama 3.1 70 B – FP8 whole-model resident, 128 k context, 2,850 tokens/s per chip (batch=256)
Llama 3.1 405 B – 4-bit quantized, 192 GB exact fit, 620 tokens/s per chip
DeepSeek-R1 671 B MoE – 37 B active params in FP8, 3,100 tokens/s per chip
Qwen2.5-72 B – INT8, 3,400 tokens/s; switch to FP8 for 2,850 tokens/s and full accuracy
Multimodal (Gemini 1.5 Pro) – 1 M token context, 200 k image+text input streams at 28 fps

Use Cases & Target Audience

You choose Ironwood when your daily bill is measured in billions of tokens, not dollars per hour.

SaaS APIs – Serve fine-tuned 70 B chat models to millions of users; 9,216-chip pod handles 25 M QPS with P99 <100 ms
Batch inference – Overnight scoring of trillion-token corpora; 42.5 EFLOPS finishes 400 B param pre-training eval in 3 hours
Reasoning loops – Agentic workflows that chain multiple 70 B calls; high single-thread bandwidth keeps KV-cache hot
Cost-sensitive startups – Google Cloud committed-use contracts drop per-1k-token price below $0.0008, competitive with Inferentia2

Not for you if you need on-prem or edge: Ironwood never leaves Google’s liquid-cooled halls. If you want a 192 GB card in your own box, look at AMD MI300X or the upcoming NVIDIA B100.

How It Compares

NVIDIA H100 80 GB SXM

VRAM: 80 GB → needs 2× tensor-parallel for 70 B FP8, adds 8% latency
Bandwidth: 3.35 TB/s → 2.2× lower; context-switching drops throughput 35%
Perf/Watt: ~3.5 FP8 TFLOPS/W vs. 7.7 for Ironwood
Pick H100 if your stack is CUDA-native and you already own InfiniBand. Pick Ironwood for raw opex reduction at scale.

AWS Inferentia2 (Trn1)

VRAM: 32 GB per NeuronCore → sharding 70 B across 8 chips; compilation times measured in hours
Throughput: 1,900 tokens/s on 70 B—45% of Ironwood single-chip
Price: ~30% cheaper per chip-hour, but 2.2× more chips needed
Pick Inferentia2 for dev/QA where compile latency is tolerable. Pick Ironwood for production where time-to-first-token is revenue.

Bottom line: Ironwood is the fastest path to serve the largest open-weight models today, provided you’re willing to live inside Google’s cloud.

Compatible AI Models

Hide F tierOnly popular models

56 models


Llama 3.1 70B InstructMeta	70B	SS	52.7 tok/s	112.8 GB
Llama 3.3 70B InstructMeta	70B	SS	52.7 tok/s	112.8 GB
Nvidia Nemotron 3 SuperNVIDIA	120B(12B active)	SS	57.4 tok/s	103.5 GB
DeepSeek-V4-FlashDeepSeek	284B(13B active)	SS	53.0 tok/s	112.0 GB
Llama 4 MaverickMeta	400B(17B active)	SS	40.6 tok/s	146.4 GB
GLM-5Z.ai	744B(40B active)	SS	67.7 tok/s	87.7 GB
GLM-5.1Z.ai	744B(40B active)	SS	67.7 tok/s	87.7 GB
Kimi K2.6Moonshot AI	1000B(32B active)	SS	68.9 tok/s	86.2 GB
Kimi K2 Instruct 0905Moonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Kimi K2 ThinkingMoonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Kimi K2.5Moonshot AI	1000B(32B active)	SS	70.2 tok/s	84.6 GB
Falcon 180BTechnology Innovation Institute	180B	SS	55.1 tok/s	107.8 GB
GLM-4.6Z.ai	355B(32B active)	SS	84.6 tok/s	70.3 GB
Gemma 4 31B ITGoogle	31B	SS	72.5 tok/s	82.0 GB
Mistral Large 3 675BMistral AI	675B(41B active)	SS	89.7 tok/s	66.3 GB
Qwen3.6-27BAlibaba Cloud	27B	SS	81.6 tok/s	72.8 GB
Qwen3.5-27BAlibaba Cloud (Qwen)	27B	SS	81.6 tok/s	72.8 GB
DeepSeek-V3DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-R1DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-V3.1DeepSeek	671B(37B active)	SS	99.3 tok/s	59.8 GB
DeepSeek-V3.2DeepSeek	685B(37B active)	SS	99.3 tok/s	59.8 GB
GLM-4.7Z.ai	358B(32B active)	SS	112.9 tok/s	52.6 GB
GLM-4.5Z.ai	355B(32B active)	SS	114.6 tok/s	51.8 GB
Kimi K2 InstructMoonshot AI	1000B(32B active)	SS	114.6 tok/s	51.8 GB
Qwen3.5-397B-A17BAlibaba Cloud (Qwen)	397B(17B active)	SS	129.1 tok/s	46.0 GB

Rows per page

Page 1 of 3

Google TPU v7 (Ironwood)

Quick Specs

Specifications

Overview

AI Performance & Specifications

What Models Can It Run?

Use Cases & Target Audience

How It Compares

Compatible AI Models

Similar Products

Google Cloud TPU v6e (Trillium)

Google Cloud TPU v5p

Google Cloud TPU v5e

Google TPU v7 (Ironwood)

Quick Specs

Specifications

Overview

AI Performance & Specifications

What Models Can It Run?

Use Cases & Target Audience

How It Compares

Compatible AI Models

Similar Products

Google Cloud TPU v6e (Trillium)

Google Cloud TPU v5p

Google Cloud TPU v5e