Google

Google Cloud TPU v6e (Trillium)

Google's latest generation TPU codenamed Trillium, offering significant improvements in performance per watt over v5e. Designed for both training and inference at scale.

Google TPUsIn Stock

EnterpriseData CenterBest for LLMsHigh ThroughputEnergy Efficient

Read Google Docs

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM32 GB

Memory BW1640 GB/s

Max ParamsLarge models via pods

ArchitectureTPU v6e (Trillium)

Memory Per Chip32GB HBM

Performance vs v5e~4.7x compute per chip

Energy Efficiency~67% improvement per FLOP vs v5e

Cloud AvailabilityGoogle Cloud only

Max Pod Size256 chips

Our Take

Best for: Comfortable home for 70B at Q4

A 70B Q4 quant fits with usable context budget left over. Sweet spot if you want a single card that handles every open model worth running locally today.

Pair this withminimax-m2.5 (230B)Largest popular open model that fits at Q4 — needs roughly 22.7 GB on this 32 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

Google Cloud TPU v6e (Trillium) for AI: The New Standard for Scale

The Google Cloud TPU v6e, codenamed Trillium, represents Google’s sixth-generation custom AI accelerator. Unlike consumer-grade GPUs, Trillium is a purpose-built ASIC designed specifically to accelerate the linear algebra operations that define modern transformer architectures. Positioned as a direct competitor to NVIDIA’s H100 and Blackwell architectures in the data center, the v6e is engineered to handle the massive compute requirements of the latest frontier models.

While practitioners often look for the best hardware for local AI agents in 2025, the TPU v6e is a cloud-native resource that functions as a "virtual local" environment through Google Cloud’s Vertex AI and GKE. It is built for teams that have outgrown single-node workstations and require high-throughput training and inference for Large Language Models (LLMs). With a focus on performance-per-watt and massive interconnectivity, Trillium is designed to scale from a single chip to 256-chip pods, making it a primary choice for enterprise-grade AI development.

AI Performance & Specifications

The technical leap from the previous v5e generation is substantial. The Google Cloud TPU v6e (Trillium) AI inference performance is driven by a roughly 4.7x increase in peak compute performance per chip compared to its predecessor. For engineers, this translates to significantly lower latency during the prefill stage of LLM inference and higher throughput during decoding.

VRAM and Memory Architecture: Each Trillium chip features 32GB of HBM (High Bandwidth Memory). While a "32GB GPU for AI" might seem mid-range in the consumer space, the TPU’s strength lies in its 1640 GB/s memory bandwidth. This high bandwidth is critical for overcoming the memory-wall in LLM inference, where the speed of token generation is often limited by how fast weights can be moved to the compute units.
Energy Efficiency: Google reports a ~67% improvement in energy efficiency per FLOP compared to the v5e. For teams running 24/7 inference clusters or large-scale fine-tuning jobs, this efficiency reflects directly in the operational cost-per-token.
Pod Scaling: The v6e supports a Max Pod Size of 256 chips. This interconnect allows practitioners to treat a pod as a single massive accelerator, facilitating the training and serving of models that far exceed the 32GB VRAM of a single chip.

What Models Can It Run?

The Google Cloud TPU v6e (Trillium) VRAM for large language models is optimized for high-density deployments. Because TPUs utilize the XLA (Accelerated Linear Algebra) compiler, they are exceptionally efficient at running models optimized in JAX, PyTorch, or TensorFlow.

Model Compatibility and Quantization

Llama 3.1 8B & 70B: A single v6e chip can comfortably host a Llama 3.1 8B model at FP16 or BF16 precision with ample room for long context windows. For the 70B variant, the TPU v6e is designed to be used in "Multi-Slice" or Pod configurations. By spanning the model across 4 to 8 chips, practitioners can achieve high Google Cloud TPU v6e (Trillium) tokens per second even at high precision.
Mistral & Mixtral 8x7B: The MoE (Mixture of Experts) architecture of Mixtral 8x7B benefits significantly from the v6e’s high memory bandwidth. The 32GB HBM per chip allows for efficient sharding of experts across a small cluster, maintaining low latency for agentic workflows.
DeepSeek-R1 & Qwen 2.5: For researchers running DeepSeek-R1 or Qwen 2.5, the v6e provides the necessary compute to handle the complex reasoning traces of these models. While "Google Cloud TPU v6e (Trillium) local LLM" usage isn't possible in the physical sense, the low-latency networking of Google Cloud makes it feel like local hardware for developers using IDE integrations.

The Sweet Spot: BF16 and Quantization

While GPUs often rely on 4-bit or 8-bit quantization (GGUF/EXL2) to fit models into VRAM, TPUs are optimized for BF16 (Bfloat16). The v6e hardware is designed to run BF16 at native speeds, providing a better quality-to-speed tradeoff than heavily compressed models on consumer hardware. For those seeking the best AI chip for local deployment via cloud-based API endpoints, the v6e provides superior precision-weighted throughput.

Use Cases & Target Audience

The TPU v6e is not a general-purpose processor; it is a laser-focused AI accelerator.

AI Agent Developers: Teams building agentic workflows that require parallel execution of multiple model calls will find the v6e’s throughput capabilities ideal. It allows for the simultaneous processing of multiple "thoughts" or "actions" in an agent chain without the bottlenecking seen on lower-bandwidth hardware.
Enterprise Inference Servers: For organizations serving models to thousands of users, the performance-per-dollar of the v6e makes it a formidable alternative to the NVIDIA L4 or A100. It is specifically built for running Large models via pods parameter models, where horizontal scaling is required.
Fine-Tuning & Distillation: ML researchers looking to distill larger frontier models into smaller, task-specific models (like a 7B Qwen or Llama) can utilize the v6e pod structure to complete training runs in hours rather than days.
Multimodal Workloads: With the increase in compute density, the v6e is well-suited for vision-language models (VLMs) and video generation tasks that require massive FLOPs for pixel-space operations.

How It Compares

When evaluating Google TPUs for AI development, the primary comparison is usually with NVIDIA’s data center offerings.

TPU v6e vs. NVIDIA L4/L40S: The L4 is a popular choice for "cheap" cloud inference with 24GB VRAM. However, the TPU v6e (Trillium) offers significantly higher memory bandwidth (1640 GB/s vs 300 GB/s on the L4) and nearly 5x the compute density of the v5e. For high-throughput production, the v6e is the clear winner.
TPU v6e vs. NVIDIA H100 (80GB): The H100 has more VRAM per individual chip (80GB vs 32GB), which allows it to fit larger model shards on a single unit. However, the TPU v6e is often more cost-effective for large-scale deployments due to Google’s vertical integration. When running models across a pod, the TPU's interconnect performance often rivals or exceeds InfiniBand-linked H100 clusters for specific transformer workloads.

For practitioners deciding on the best google tpus for running AI models, the v6e (Trillium) is the current price-performance leader for enterprise-scale inference. While it lacks the "plug-and-play" local accessibility of an RTX 3090 or 4090, its ability to scale to 256 chips and its massive 1.6 TB/s bandwidth make it the superior choice for professional AI deployment and high-load agentic systems.

Compatible AI Models

Hide F tierOnly popular models

56 models


minimax-m2.5MiniMax	230B(10B active)	SS	58.2 tok/s	22.7 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	SS	116.2 tok/s	11.4 GB
Falcon 40B InstructTechnology Innovation Institute	40B	SS	54.2 tok/s	24.4 GB
Qwen3.5-9BAlibaba	9B	SS	53.7 tok/s	24.6 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	SS	119.9 tok/s	11.0 GB
Llama 3.1 8B InstructMeta	8B	SS	99.0 tok/s	13.3 GB
Qwen3.6 35B-A3BAlibaba	35B(3B active)	SS	154.7 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba	35B(3B active)	SS	154.7 tok/s	8.5 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Llama 2 13B ChatMeta	13B	SS	155.9 tok/s	8.5 GB
Qwen3.5-122B-A10BAlibaba	122B(10B active)	SS	48.4 tok/s	27.3 GB
Qwen3-30B-A3BAlibaba	30B(3B active)	SS	245.1 tok/s	5.4 GB
Carnice-9b for Hermes agentkai-os	9B	SS	219.5 tok/s	6.0 GB
Llama 3 8B InstructMeta	8B	AA	233.1 tok/s	5.7 GB
Gemma 4 E4B ITGoogle	4B	AA	190.9 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	190.9 tok/s	6.9 GB
Mistral 7B InstructMistral AI	7B	AA	206.4 tok/s	6.4 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
Llama 2 7B ChatMeta	7B	AA	275.7 tok/s	4.8 GB
Gemma 4 E2B ITGoogle	2B	AA	356.0 tok/s	3.7 GB
Mistral Small 3 24BMistral AI	24B	FF	33.9 tok/s	39.0 GB
Qwen3.6-27BAlibaba	27B	FF	18.1 tok/s	72.8 GB
Gemma 3 27B ITGoogle	27B	FF	30.1 tok/s	43.8 GB
Qwen3.5-27BAlibaba	27B	FF	18.1 tok/s	72.8 GB
Gemma 4 31B ITGoogle	31B	FF	16.1 tok/s	82.0 GB
Qwen3-32BAlibaba	32.8B	FF	24.5 tok/s	53.9 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
LLaMA 65BMeta	65B	FF	33.6 tok/s	39.3 GB

Rows per page

Page 1 of 3