What is the Cerebras WSE-3?

Cerebras WSE-3: 4 trillion transistors, 900,000 AI cores, 44 GB SRAM, 21 PB/s on-chip bandwidth, TSMC 5nm.

What is Groq LPU used for?

Groq LPU achieves 500+ tokens/second for Llama 2 70B. Optimized for low-latency deterministic inference.

How does AWS Trainium2 compare to H100?

AWS Trainium2 is 20-40% cheaper than H100 P5 for training on AWS, requiring the Neuron SDK.

AI Accelerator Specifications: Google TPU, AWS Trainium, Cerebras WSE, Groq LPU

Q: What are Google TPU v5p specifications?

Google TPU v5p: 95 GB HBM2e, 2.76 TB/s, 459 TFLOPS BF16, max pod 8,960 chips. Google Cloud only.

AI Accelerator Comparison: Which Is Best for Your Workload?

Google TPU v5p — best for large-scale Google Cloud training (8,960 chips). AWS Trainium2 — 20–40% cheaper than H100 for AWS training. Cerebras WSE-3 — eliminates HBM bottleneck. Groq LPU — 500+ tokens/sec inference.

Accelerator	Memory	Bandwidth	Peak Performance	Architecture	Best For
Google TPU v5p	95 GB HBM2e	2.76 TB/s	459 TFLOPS BF16	SparseCore + MXU	Large-scale training
Google TPU v5e	16 GB HBM2e	819 GB/s	—	MXU	Cost-efficient inference
AWS Trainium2	96 GB HBM	2.4 TB/s	~500 TFLOPS BF16	NeuronCore-v3	AWS-native training
AWS Inferentia2	32 GB HBM2e	2.4 TB/s	—	NeuronCore-v2	AWS-native inference
Cerebras WSE-3	44 GB SRAM	21 PB/s on-chip	—	Wafer-scale	Ultra-large model training
Groq LPU	230 MB SRAM	80 TB/s on-chip	—	TSP	Low-latency inference

Google TPU Specifications

What are Google TPU v5p specifications?

TPU v5p: 459 TFLOPS BF16 per chip, pods scaling to 8,960 chips via ICI interconnect. Available exclusively on Google Cloud.

Specification	TPU v5p	TPU v5e	TPU v4
Memory per Chip	95 GB HBM2e	16 GB HBM2e	32 GB HBM2e
Memory Bandwidth	2.76 TB/s	819 GB/s	1.2 TB/s
Peak BF16	459 TFLOPS	—	275 TFLOPS
Max Pod Size	8,960 chips	256 chips	4,096 chips
Availability	Google Cloud only

Source: Google Cloud TPU v5p documentation

AWS Custom Silicon: Trainium2 and Inferentia2

What are AWS Trainium2 specifications?

AWS Trainium2 is 20–40% cheaper than P5 (H100) for equivalent training throughput. Requires AWS Neuron SDK.

Specification	Trainium2	Inferentia2
Compute Cores	NeuronCore-v3	2x NeuronCore-v2
Memory	96 GB HBM	32 GB HBM2e
Memory Bandwidth	2.4 TB/s	2.4 TB/s
Supported Precisions	FP32, BF16, FP8	FP32, BF16, FP8
Instance Type	trn2.48xlarge (16 chips)	inf2.48xlarge (12 chips)

Source: AWS Trainium product page

Cerebras WSE-3: Wafer-Scale AI Training

Cerebras WSE-3: 4 trillion transistors, 900,000 AI cores, 21 PB/s on-chip bandwidth. Keeps entire model in SRAM, eliminating HBM bandwidth bottleneck.

Specification	WSE-3	WSE-2
Transistors	4 trillion	2.6 trillion
AI Cores	900,000	850,000
On-Chip Memory	44 GB SRAM	40 GB SRAM
On-Chip Bandwidth	21 PB/s	20 PB/s
Process Node	TSMC 5nm	TSMC 7nm
External Memory	Up to 1.5 TB (MemoryX)	—

Groq LPU: Fastest LLM Inference

Groq LPU achieves 500+ tokens/second for Llama 2 70B on GroqCloud via deterministic TSP architecture.

Specification	Value
Architecture	TSP (Tensor Streaming Processor)
On-Chip Memory	230 MB SRAM
On-Chip Bandwidth	80 TB/s
Execution	Deterministic (no stochastic latency)
LLM Throughput	500+ tokens/sec (Llama 2 70B)

Frequently Asked Questions

What is Google TPU v5p memory capacity?

Google TPU v5p has 95 GB HBM2e per chip, 2.76 TB/s bandwidth, 459 TFLOPS BF16. Pods scale to 8,960 chips. Google Cloud only.

What is the Groq LPU inference speed for Llama 70B?

Groq LPU achieves 500+ tokens/second for Llama 2 70B — vs ~85–155 tokens/second on H100/H200 with vLLM. Deterministic, jitter-free latency.

Is AWS Trainium cheaper than H100 for training?

AWS Trainium2 is typically 20–40% cheaper than P5 (H100) for equivalent AWS training throughput. Requires AWS Neuron SDK migration (2–4 weeks for most PyTorch models).

TPU vs GPU: when should I use Google TPU instead of H100?

Use Google TPU when: on Google Cloud, training with JAX/TensorFlow, running large-scale pretraining (pods to 8,960 chips). Use H100 for PyTorch workloads requiring maximum flexibility.