made by agents
Google's latest generation TPU codenamed Trillium, offering significant improvements in performance per watt over v5e. Designed for both training and inference at scale.
The Google Cloud TPU v6e, codenamed Trillium, represents Google’s sixth-generation custom AI accelerator. Unlike consumer-grade GPUs, Trillium is a purpose-built ASIC designed specifically to accelerate the linear algebra operations that define modern transformer architectures. Positioned as a direct competitor to NVIDIA’s H100 and Blackwell architectures in the data center, the v6e is engineered to handle the massive compute requirements of the latest frontier models.
While practitioners often look for the best hardware for local AI agents in 2025, the TPU v6e is a cloud-native resource that functions as a "virtual local" environment through Google Cloud’s Vertex AI and GKE. It is built for teams that have outgrown single-node workstations and require high-throughput training and inference for Large Language Models (LLMs). With a focus on performance-per-watt and massive interconnectivity, Trillium is designed to scale from a single chip to 256-chip pods, making it a primary choice for enterprise-grade AI development.
The technical leap from the previous v5e generation is substantial. The Google Cloud TPU v6e (Trillium) AI inference performance is driven by a roughly 4.7x increase in peak compute performance per chip compared to its predecessor. For engineers, this translates to significantly lower latency during the prefill stage of LLM inference and higher throughput during decoding.
The Google Cloud TPU v6e (Trillium) VRAM for large language models is optimized for high-density deployments. Because TPUs utilize the XLA (Accelerated Linear Algebra) compiler, they are exceptionally efficient at running models optimized in JAX, PyTorch, or TensorFlow.
While GPUs often rely on 4-bit or 8-bit quantization (GGUF/EXL2) to fit models into VRAM, TPUs are optimized for BF16 (Bfloat16). The v6e hardware is designed to run BF16 at native speeds, providing a better quality-to-speed tradeoff than heavily compressed models on consumer hardware. For those seeking the best AI chip for local deployment via cloud-based API endpoints, the v6e provides superior precision-weighted throughput.
The TPU v6e is not a general-purpose processor; it is a laser-focused AI accelerator.
When evaluating Google TPUs for AI development, the primary comparison is usually with NVIDIA’s data center offerings.
For practitioners deciding on the best google tpus for running AI models, the v6e (Trillium) is the current price-performance leader for enterprise-scale inference. While it lacks the "plug-and-play" local accessibility of an RTX 3090 or 4090, its ability to scale to 256 chips and its massive 1.6 TB/s bandwidth make it the superior choice for professional AI deployment and high-load agentic systems.
Mixtral 8x7B InstructMistral AI | 46.7B(12.9B active) | SS | 116.2 tok/s | 11.4 GB | |
Falcon 40B InstructTechnology Innovation Institute | 40B | SS | 54.2 tok/s | 24.4 GB | |
Qwen3.5-9BAlibaba Cloud (Qwen) | 9B | SS | 53.7 tok/s | 24.6 GB | |
Gemma 4 26B-A4B ITGoogle | 26B(4B active) | SS | 119.9 tok/s | 11.0 GB | |
| 8B | SS | 99.0 tok/s | 13.3 GB | ||
Qwen3.5-35B-A3BAlibaba Cloud (Qwen) | 35B(3B active) | SS | 154.7 tok/s | 8.5 GB | |
Llama 2 13B ChatMeta | 13B | SS | 155.9 tok/s | 8.5 GB | |
Qwen3.5-122B-A10BAlibaba Cloud (Qwen) | 122B(10B active) | SS | 48.4 tok/s | 27.3 GB | |
Qwen3-30B-A3BAlibaba Cloud (Qwen) | 30B(3B active) | SS | 245.1 tok/s | 5.4 GB | |
| 8B | AA | 233.1 tok/s | 5.7 GB | ||
Gemma 4 E4B ITGoogle | 4B | AA | 190.9 tok/s | 6.9 GB | |
Gemma 3 4B ITGoogle | 4B | AA | 190.9 tok/s | 6.9 GB | |
Mistral 7B InstructMistral AI | 7B | AA | 206.4 tok/s | 6.4 GB | |
Llama 2 7B ChatMeta | 7B | AA | 275.7 tok/s | 4.8 GB | |
Gemma 4 E2B ITGoogle | 2B | AA | 356.0 tok/s | 3.7 GB | |
Mistral Small 3 24BMistral AI | 24B | FF | 33.9 tok/s | 39.0 GB | |
Gemma 3 27B ITGoogle | 27B | FF | 30.1 tok/s | 43.8 GB | |
Qwen3.5-27BAlibaba Cloud (Qwen) | 27B | FF | 18.1 tok/s | 72.8 GB | |
Gemma 4 31B ITGoogle | 31B | FF | 16.1 tok/s | 82.0 GB | |
Qwen3-32BAlibaba Cloud (Qwen) | 32.8B | FF | 24.5 tok/s | 53.9 GB | |
LLaMA 65BMeta | 65B | FF | 33.6 tok/s | 39.3 GB | |
Llama 2 70B ChatMeta | 70B | FF | 30.4 tok/s | 43.4 GB | |
| 70B | FF | 28.9 tok/s | 45.7 GB | ||
| 70B | FF | 11.7 tok/s | 112.8 GB | ||
| 70B | FF | 11.7 tok/s | 112.8 GB |