# AI Infrastructure Index - Full Content for AI/LLM Consumption
# https://alpha-one-index.github.io/ai-infra-index/
# Maintained by Alpha One Index | MIT License | Last updated: 2026-03-01

---

# GPU SPECIFICATIONS

## NVIDIA Data Center GPUs

### NVIDIA B200 (Blackwell) - 2025
- Architecture: Blackwell (TSMC 4NP)
- GPU Memory: 192 GB HBM3e
- Memory Bandwidth: 8.0 TB/s
- FP32 Performance: 80 TFLOPS
- FP16 / BF16 Performance: 4,500 TFLOPS
- FP8 / FP4 Performance: 9,000 / 18,000 TFLOPS
- TDP: 1,000W
- Interconnect: NVLink 5.0 (1,800 GB/s)
- PCIe: Gen 6.0
- CUDA Cores: 21,760
- MSRP: ~$40,000

### NVIDIA H200 SXM - 2024
- Architecture: Hopper (TSMC 4N)
- GPU Memory: 141 GB HBM3e
- Memory Bandwidth: 4.8 TB/s
- FP16 / BF16 Performance: 989 TFLOPS (same as H100 - memory upgrade only)
- FP8 Performance: 1,979 TFLOPS
- TDP: 700W
- Interconnect: NVLink 4.0 (900 GB/s)
- CUDA Cores: 16,896
- MSRP: ~$35,000
- Key difference vs H100: 76% more memory (141 vs 80 GB), 43% more bandwidth (4.8 vs 3.35 TB/s)

### NVIDIA H100 SXM - 2023
- Architecture: Hopper (TSMC 4N)
- GPU Memory: 80 GB HBM3
- Memory Bandwidth: 3.35 TB/s
- FP32 Performance: 67 TFLOPS
- FP16 / BF16 Performance: 989 TFLOPS
- FP8 Performance: 1,979 TFLOPS
- TDP: 700W
- Interconnect: NVLink 4.0 (900 GB/s)
- PCIe: Gen 5.0
- CUDA Cores: 16,896
- MSRP: ~$30,000

### NVIDIA H100 PCIe - 2023
- GPU Memory: 80 GB HBM3
- Memory Bandwidth: 2.0 TB/s (vs 3.35 TB/s SXM)
- FP16 Performance: 756 TFLOPS (vs 989 SXM)
- TDP: 350W (vs 700W SXM)

### NVIDIA A100 SXM 80GB - 2020
- Architecture: Ampere (TSMC 7N)
- GPU Memory: 80 GB HBM2e
- Memory Bandwidth: 2.0 TB/s
- FP16 / BF16 Performance: 312 TFLOPS
- TDP: 400W
- Interconnect: NVLink 3.0 (600 GB/s)
- Status: EOL (February 2024)

### NVIDIA L40S - 2023
- Architecture: Ada Lovelace (TSMC 4N)
- GPU Memory: 48 GB GDDR6
- Memory Bandwidth: 864 GB/s
- FP16 Performance: 733 TFLOPS
- FP8 Performance: 1,466 TFLOPS
- TDP: 350W
- Best for: Cost-efficient inference for models <=13B parameters

## AMD Instinct Data Center GPUs

### AMD Instinct MI300X - 2024
- Architecture: CDNA 3 (TSMC 5nm + 6nm chiplet)
- GPU Memory: 192 GB HBM3
- Memory Bandwidth: 5.3 TB/s
- FP16 Performance: 1,307 TFLOPS
- FP8 Performance: 2,614 TFLOPS
- TDP: 750W
- Interconnect: Infinity Fabric (896 GB/s)
- MSRP: ~$15,000
- vs H100: 2.4x more memory, 1.58x more bandwidth, 1.32x more FP16 TFLOPS

### AMD Instinct MI325X
- GPU Memory: 256 GB HBM3e (highest memory GPU available)
- TDP: 750W

## Intel Data Center GPUs

### Intel Gaudi 3 - 2025
- Memory: 128 GB HBM2e
- Memory Bandwidth: 3.7 TB/s
- BF16 Performance: 1,835 TFLOPS
- FP8 Performance: 3,670 TFLOPS
- TDP: 900W
- Networking: 24x 200GbE RoCE v2

### Intel Gaudi 2 - 2023
- Memory: 96 GB HBM2e
- Memory Bandwidth: 2.45 TB/s
- BF16 Performance: 432 TFLOPS
- TDP: 600W

---

# CLOUD GPU PRICING (USD per GPU-hour, March 2026)

## H100 SXM 80GB
- Vast.ai: $1.87-$3.50/hr
- GMI Cloud: $2.10/hr
- RunPod: $2.49/hr, spot $1.89/hr
- Lambda Labs: $2.99/hr
- Google Cloud A3-High: $3.67/hr, spot $2.25/hr
- AWS P5: $3.93/hr, spot $2.50/hr
- Azure ND H100 v5: $3.50-$5.00/hr
- CoreWeave HGX H100: $6.15/hr

## H200 141GB
- Lambda Labs: $3.29/hr
- GMI Cloud: $3.35/hr
- RunPod: $3.59/hr
- CoreWeave: $6.31/hr

## B200 192GB (Early 2026)
- Lambda Labs: $4.99/hr
- RunPod: $5.98/hr
- CoreWeave: $8.60/hr
- AWS P6: ~$14.00/hr

## A100 SXM 80GB
- Vast.ai: $0.80-$1.50/hr
- RunPod: $1.39/hr, spot $0.79/hr
- Lambda Labs: $1.79/hr
- AWS P4d: $2.75/hr

## H100 Price History
- Q4 2023: $8.00-$10.00/hr (hyperscalers), $4.00-$5.00/hr (specialist)
- Q2 2024: $6.50-$8.00/hr (hyperscalers)
- Q4 2024: $5.00-$6.50/hr (hyperscalers)
- Q2 2025: $3.50-$4.50/hr (after AWS -44% cut June 2025)
- Q1 2026: $3.50-$4.00/hr (hyperscalers), $1.87-$3.00/hr (specialist)

---

# AI ACCELERATOR SPECIFICATIONS

## Google TPU
### TPU v5p
- Memory: 95 GB HBM2e per chip, 2.76 TB/s
- BF16: 459 TFLOPS, INT8: 918 TOPS
- Max Pod: 8,960 chips

### TPU v5e
- Memory: 16 GB HBM2e, 819 GB/s
- Max Pod: 256 chips, best for cost-efficient inference

## AWS Custom Silicon
### Trainium2
- Memory: 96 GB HBM, 2.4 TB/s
- Instance: trn2.48xlarge (16 chips)
- Cost: 20-40% cheaper than H100 on AWS

### Inferentia2
- Memory: 32 GB HBM2e, 2.4 TB/s

## Cerebras WSE-3
- Transistors: 4 trillion, AI Cores: 900,000
- On-Chip Memory: 44 GB SRAM
- On-Chip Bandwidth: 21 PB/s
- External Memory: Up to 1.5 TB via MemoryX

## Groq LPU
- Architecture: TSP, On-Chip: 230 MB SRAM, 80 TB/s
- LLM: 500+ tokens/sec (Llama 2 70B)
- Inference only (not for training)

---

# INFERENCE BENCHMARKS

## MLPerf v4.1 (November 2024) - Llama 2 70B Offline
- B200 SXM (8x): ~210 samples/sec
- H200 SXM (8x): 118.5 samples/sec
- H100 SXM (8x): 84.2 samples/sec
- TPU v5e (8x): 45.3 samples/sec
- A100 SXM (8x): 32.1 samples/sec
- Gaudi 2 (8x): 22.8 samples/sec

## Tokens/Second - Llama 2 70B
- H200 TensorRT-LLM FP16: ~155 tok/sec
- H100 TensorRT-LLM FP16: ~110 tok/sec
- H100 vLLM FP16: ~85 tok/sec
- A100 vLLM FP16: ~35 tok/sec

## Quantization (H100 SXM, Llama 2 70B)
- FP16: ~85 tok/sec, ~140 GB VRAM
- FP8: ~155 tok/sec, ~70 GB VRAM, <1% quality loss
- INT4 AWQ: ~195 tok/sec, ~35 GB VRAM, 1-2% quality loss

## Performance Per Dollar
- H200 Lambda $3.29/hr: ~169,600 tokens/dollar (best)
- H100 Lambda $2.99/hr: ~132,400 tokens/dollar
- H100 AWS $3.93/hr: ~100,800 tokens/dollar

---

# MODEL GPU SIZING GUIDE

## VRAM Requirements (FP16 Inference)
- Llama 3 8B: ~16 GB - 1x RTX 4090
- Llama 3 70B: ~140 GB - 2x H100 80GB or 1x H200
- Llama 3 405B: ~810 GB - 11x H100 minimum
- Mixtral 8x7B: ~94 GB - 2x A100 80GB
- DeepSeek V3 (671B): ~1,340 GB - 17x H100 minimum

## VRAM Formula
VRAM (GB) = (Parameters x bytes_per_param) / 1e9 x 1.2
- FP32=4 bytes, BF16/FP16=2 bytes, INT8=1 byte, INT4=0.5 bytes

## Training Memory
- Full fine-tuning (AdamW): 18x parameter count in bytes
- QLoRA: ~0.5x parameter count

---

# NETWORKING & INTERCONNECTS

## NVLink
- NVLink 3.0 (A100): 600 GB/s
- NVLink 4.0 (H100/H200): 900 GB/s
- NVLink 5.0 (B200/GB200): 1,800 GB/s

## InfiniBand
- HDR: 200 Gb/s
- NDR: 400 Gb/s (H100/H200 clusters)
- XDR: 800 Gb/s (B200/GB200 clusters)

---

# AI TRAINING COSTS

## Historical Costs
- GPT-3 175B (2020): ~$4.6M
- LLaMA 2 70B (2023): ~$2.1M
- GPT-4 est. (2023): ~$63M
- LLaMA 3 70B (2024): ~$7.7M
- DeepSeek V3 (2024): ~$5.5M

## Formula
GPU_Hours = (6 x Parameters x Tokens) / (GPU_FLOPS x MFU x 3600)
Cost = GPU_Hours x Hourly_GPU_Rate

---

# GPU COST OPTIMIZATION

## Key Savings
- Right-sizing (H100 to L4 for 7B): 60-70%
- FP8 quantization: ~50%
- Spot vs on-demand: 35-60%
- Reserved 1-year: 40-46%
- Provider arbitrage AWS to RunPod: 36-52%

---

# BUY VS RENT

## TCO (H100, 85% util, 3-year)
- 8 GPUs: $12K/mo on-prem vs $57K/mo cloud (79% cheaper on-prem)
- 64 GPUs: $78K/mo vs $282K/mo (72% cheaper)

## When to Choose Cloud
- Utilization below 50%, horizon under 12 months, variable workloads

## When to Choose On-Premise
- Utilization >70%, 24+ month horizon, annual cloud bill >$1M

---

Repository: https://github.com/alpha-one-index/ai-infra-index
Last updated: 2026-03-01