AI Accelerator Specifications
Complete specifications for custom AI accelerators: Google TPU v5p/v5e/v4, AWS Trainium2/Inferentia2, Cerebras WSE-3, and Groq LPU.
AI Accelerator Comparison: Which Is Best for Your Workload?
| Accelerator | Memory | Bandwidth | Peak Performance | Architecture | Best For |
|---|---|---|---|---|---|
| Google TPU v5p | 95 GB HBM2e | 2.76 TB/s | 459 TFLOPS BF16 | SparseCore + MXU | Large-scale training |
| Google TPU v5e | 16 GB HBM2e | 819 GB/s | — | MXU | Cost-efficient inference |
| AWS Trainium2 | 96 GB HBM | 2.4 TB/s | ~500 TFLOPS BF16 | NeuronCore-v3 | AWS-native training |
| AWS Inferentia2 | 32 GB HBM2e | 2.4 TB/s | — | NeuronCore-v2 | AWS-native inference |
| Cerebras WSE-3 | 44 GB SRAM | 21 PB/s on-chip | — | Wafer-scale | Ultra-large model training |
| Groq LPU | 230 MB SRAM | 80 TB/s on-chip | — | TSP | Low-latency inference |
Google TPU Specifications
What are Google TPU v5p specifications?
TPU v5p: 459 TFLOPS BF16 per chip, pods scaling to 8,960 chips via ICI interconnect. Available exclusively on Google Cloud.
| Specification | TPU v5p | TPU v5e | TPU v4 |
|---|---|---|---|
| Memory per Chip | 95 GB HBM2e | 16 GB HBM2e | 32 GB HBM2e |
| Memory Bandwidth | 2.76 TB/s | 819 GB/s | 1.2 TB/s |
| Peak BF16 | 459 TFLOPS | — | 275 TFLOPS |
| Max Pod Size | 8,960 chips | 256 chips | 4,096 chips |
| Availability | Google Cloud only | ||
AWS Custom Silicon: Trainium2 and Inferentia2
What are AWS Trainium2 specifications?
AWS Trainium2 is 20–40% cheaper than P5 (H100) for equivalent training throughput. Requires AWS Neuron SDK.
| Specification | Trainium2 | Inferentia2 |
|---|---|---|
| Compute Cores | NeuronCore-v3 | 2x NeuronCore-v2 |
| Memory | 96 GB HBM | 32 GB HBM2e |
| Memory Bandwidth | 2.4 TB/s | 2.4 TB/s |
| Supported Precisions | FP32, BF16, FP8 | FP32, BF16, FP8 |
| Instance Type | trn2.48xlarge (16 chips) | inf2.48xlarge (12 chips) |
Source: AWS Trainium product page
Cerebras WSE-3: Wafer-Scale AI Training
Cerebras WSE-3: 4 trillion transistors, 900,000 AI cores, 21 PB/s on-chip bandwidth. Keeps entire model in SRAM, eliminating HBM bandwidth bottleneck.
| Specification | WSE-3 | WSE-2 |
|---|---|---|
| Transistors | 4 trillion | 2.6 trillion |
| AI Cores | 900,000 | 850,000 |
| On-Chip Memory | 44 GB SRAM | 40 GB SRAM |
| On-Chip Bandwidth | 21 PB/s | 20 PB/s |
| Process Node | TSMC 5nm | TSMC 7nm |
| External Memory | Up to 1.5 TB (MemoryX) | — |
Groq LPU: Fastest LLM Inference
Groq LPU achieves 500+ tokens/second for Llama 2 70B on GroqCloud via deterministic TSP architecture.
| Specification | Value |
|---|---|
| Architecture | TSP (Tensor Streaming Processor) |
| On-Chip Memory | 230 MB SRAM |
| On-Chip Bandwidth | 80 TB/s |
| Execution | Deterministic (no stochastic latency) |
| LLM Throughput | 500+ tokens/sec (Llama 2 70B) |
Frequently Asked Questions
What is Google TPU v5p memory capacity?
Google TPU v5p has 95 GB HBM2e per chip, 2.76 TB/s bandwidth, 459 TFLOPS BF16. Pods scale to 8,960 chips. Google Cloud only.
What is the Groq LPU inference speed for Llama 70B?
Groq LPU achieves 500+ tokens/second for Llama 2 70B — vs ~85–155 tokens/second on H100/H200 with vLLM. Deterministic, jitter-free latency.
Is AWS Trainium cheaper than H100 for training?
AWS Trainium2 is typically 20–40% cheaper than P5 (H100) for equivalent AWS training throughput. Requires AWS Neuron SDK migration (2–4 weeks for most PyTorch models).
TPU vs GPU: when should I use Google TPU instead of H100?
Use Google TPU when: on Google Cloud, training with JAX/TensorFlow, running large-scale pretraining (pods to 8,960 chips). Use H100 for PyTorch workloads requiring maximum flexibility.