Google

Google Cloud TPU v5p

Google's high-performance TPU for large-scale training, with 2x the FLOPS and memory of v5e. Available in pods up to 8,960 chips for frontier model training.

Google TPUsIn Stock

EnterpriseData CenterBest for LLMsHigh ThroughputProduction Ready

Read Google Docs

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM95 GB

FP16459 TFLOPS

INT8918 TOPS

Memory BW2765 GB/s

Max ParamsFrontier-scale via pods

ArchitectureTPU v5p

Memory TypeHBM2e

Memory Per Chip95GB

BF16 TFLOPS459

Max Pod Size8,960 chips

Interconnect Bandwidth4,800 GB/s per chip

Cloud AvailabilityGoogle Cloud only

Our Take

Best for: Datacenter inference for flagship dense models

Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume.

Pair this withKimi K2 Instruct (1000B)Largest popular open model that fits at Q4 — needs roughly 51.8 GB on this 95 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

The Google Cloud TPU v5p represents Google’s most powerful purpose-built AI accelerator to date. Designed specifically for training and serving frontier-scale models, the v5p is a significant leap over the efficiency-focused v5e. While the v5e was built for cost-effectiveness, the v5p is engineered for raw performance, offering 2x the FLOPS and double the memory capacity.

For engineers and researchers, the Google Cloud TPU v5p is the primary alternative to the NVIDIA H100/H200 ecosystem. In the context of Google Cloud TPU v5p for AI, this hardware is not a "local" chip in the traditional desktop sense—you cannot buy one for a workstation. However, for teams building agentic workflows or deploying large-scale inference servers, it serves as the backbone for "local" private cloud deployments within the Google Cloud Platform (GCP) ecosystem. It is the premier choice for organizations that need to scale beyond single-node constraints into massive pods of up to 8,960 chips.

AI Performance & Specifications

The technical profile of the TPU v5p is defined by its massive memory ceiling and interconnect throughput. When evaluating Google Cloud TPU v5p VRAM for large language models, the 95GB of HBM2e per chip is the standout metric. This puts it in direct competition with the NVIDIA H100 (80GB) and H200 (141GB).

Key Technical Specifications:

VRAM: 95 GB HBM2e
Memory Bandwidth: 2765 GB/s
FP16/BF16 Performance: 459 TFLOPS
INT8 Performance: 918 TOPS
Interconnect Bandwidth: 4,800 GB/s (Optical Circuit Switching)
Max Pod Configuration: 8,960 chips

In AI workloads, memory bandwidth is often the primary bottleneck for inference, specifically during the auto-regressive decoding phase (token generation). At 2765 GB/s, the v5p ensures that even the largest weights are moved to the compute units fast enough to maintain high Google Cloud TPU v5p tokens per second counts. Furthermore, the 4,800 GB/s interconnect bandwidth is critical for "Frontier-scale via pods" parameter models, where model parallelism is required to split a single model across hundreds or thousands of chips.

What Models Can It Run?

The 95GB VRAM capacity changes the math for model deployment. While many consider a 95GB GPU for AI to be the "sweet spot" for 70B parameter models, the TPU v5p goes much further by utilizing high-speed pod interconnects.

Large Language Models (LLMs)

Llama 3.1 405B: The v5p is one of the few accelerators capable of training and serving the 405B version of Llama 3.1 efficiently. While a single chip cannot hold the weights, a small pod of 8 or 16 chips can run this model with massive KV caches for long-context windows.
Llama 3.1 70B & 8B: A single v5p chip can easily host a Llama 3.1 70B model at FP16 precision with room left for a significant context buffer. For the 8B model, the v5p is overkill unless you are running high-concurrency batch inference.
DeepSeek-V3 / DeepSeek-R1: Given the MoE (Mixture of Experts) architecture of DeepSeek models, the high memory bandwidth of the v5p is ideal for maintaining high throughput despite the sparse parameter activation.
Qwen 2.5 & Mistral Large 2: These models thrive on the v5p architecture, particularly when utilizing the bfloat16 (BF16) format, which is natively optimized on TPUs to prevent the overflow issues sometimes seen in standard FP16.

Quantization and Performance

While many practitioners look for the best AI chip for local deployment using 4-bit quantization (GGUF/EXL2), the TPU v5p is designed for higher precision. Most users will run models in BF16 or INT8. The 918 TOPS of INT8 performance allows for massive throughput increases when using quantized weights for production-grade inference. For long-context tasks (128k+ tokens), the 95GB of VRAM allows for larger KV caches, reducing the need for aggressive quantization that might degrade model "reasoning" capabilities.

Use Cases & Target Audience

The Google Cloud TPU v5p is not for "local LLM" hobbyists running a single workstation in a home office. It is designed for Google google tpus for AI development at the enterprise and research level.

Teams Building Agentic Workflows

For those building hardware for running Frontier-scale via pods parameter models, the v5p is the gold standard. Agentic workflows often require multiple model calls in parallel (planning, tool use, reflection). The v5p’s ability to be partitioned into "slices" allows teams to run several high-performance models on a single interconnect fabric, minimizing the latency between agent steps.

Production Inference Servers

If your application requires serving thousands of users simultaneously, the Google Cloud TPU v5p AI inference performance scales better than almost any other hardware. The Optical Circuit Switching (OCS) allows you to reconfigure your pod topology on the fly, optimizing for either latency (small batches) or throughput (large batches).

Training and Fine-Tuning

While inference is a strong suit, the "p" in v5p stands for Performance, specifically aimed at training. If you are fine-tuning a Llama 3.1 70B model on a proprietary dataset, the v5p provides a more seamless scaling path than traditional GPU clusters, thanks to the integrated XLA (Accelerated Linear Algebra) compiler that optimizes graph execution.

How It Compares

When choosing the best hardware for local AI agents 2025, you must decide between the NVIDIA ecosystem and Google’s TPU ecosystem.