AI Inference Benchmark Index
Performance benchmarks for AI inference. MLPerf v4.1 official results, tokens-per-second for Llama 2 70B/13B, GPT-J 6B, Stable Diffusion XL. Quantization impact and performance-per-dollar comparison.
MLPerf Inference v4.1 — Official Results
MLPerf Inference v4.1 published by MLCommons, November 2024.
Data Center Offline Scenario — Llama 2 70B
| Accelerator | Model | Samples/Second | System | Submitter |
|---|---|---|---|---|
| B200 SXM (8×) Best | Llama 2 70B | ~210 | DGX B200 | NVIDIA |
| H200 SXM (8×) | Llama 2 70B | 118.5 | HGX H200 | NVIDIA |
| H100 SXM (8×) | Llama 2 70B | 84.2 | DGX H100 | NVIDIA |
| TPU v5e (8×) | Llama 2 70B | 45.3 | Cloud TPU | |
| A100 SXM (8×) | Llama 2 70B | 32.1 | DGX A100 | NVIDIA |
Source: MLCommons official results, November 2024.
What Is H100 Inference Speed in Tokens Per Second?
| GPU | Framework | Tokens/sec | Batch Size | Precision |
|---|---|---|---|---|
| H200 SXM 141GB | TensorRT-LLM | 155 | 1 | FP16 |
| H200 SXM 141GB | vLLM | ~120 | 1 | FP16 |
| H100 SXM 80GB | TensorRT-LLM | ~110 | 1 | FP16 |
| H100 SXM 80GB | vLLM | ~85 | 1 | FP16 |
| A100 SXM 80GB | vLLM | ~35 | 1 | FP16 |
How Does Quantization Affect Inference Performance?
FP8 roughly doubles throughput vs FP16 with <1% quality degradation. INT4 (AWQ) allows Llama 2 70B on a single H100 80GB.
| Precision | Tokens/sec (H100 SXM) | Memory (Llama 2 70B) | Quality Loss |
|---|---|---|---|
| FP16 (baseline) | ~85 | ~140 GB (2 GPUs) | Baseline |
| FP8 (E4M3) | ~155 | ~70 GB (1 GPU) | <1% perplexity |
| INT8 (W8A8) | ~140 | ~70 GB (1 GPU) | ~1% perplexity |
| INT4 (GPTQ) | ~190 | ~35 GB (1 GPU) | 2–3% perplexity |
| INT4 (AWQ) | ~195 | ~35 GB (1 GPU) | 1–2% perplexity |
Which Cloud GPU Has the Best Performance Per Dollar?
| GPU | Cloud & Rate | Tokens/sec | Tokens/Dollar | Relative Value |
|---|---|---|---|---|
| H200 (Lambda) | $3.29/hr | ~155 | ~169,600 | 1.28× |
| H100 (Lambda) | $2.99/hr | ~110 | ~132,400 | 1.0× (baseline) |
| H100 (AWS P5) | $3.93/hr | ~110 | ~100,800 | 0.76× |
| A100 (Lambda) | $1.79/hr | ~35 | ~70,400 | 0.53× |
Frequently Asked Questions
How many tokens per second does H100 generate for Llama 2 70B?
NVIDIA H100 SXM: ~85 tokens/sec with vLLM, ~110 tokens/sec with TensorRT-LLM for Llama 2 70B at FP16 batch=1. FP8 quantization improves this to ~155 tokens/sec.
Which inference framework is fastest: vLLM or TensorRT-LLM?
TensorRT-LLM is fastest for single-GPU throughput (110 vs 85 tokens/sec — 29% faster). vLLM is better for multi-model serving and flexible deployment.
How does H200 compare to H100 for inference speed?
H200 is ~40% faster than H100 for Llama 2 70B — 155 vs 110 tokens/sec with TensorRT-LLM. Speedup comes from 43% more bandwidth (4.8 vs 3.35 TB/s).
Is FP8 quantization worth using for inference?
Yes. FP8 delivers ~155 tokens/sec vs ~85 for FP16 (1.8× improvement) on H100, using only 70 GB — fits on a single GPU. Quality loss is <1% perplexity.