MLPerf Inference v4.1 — Official Results

MLPerf Inference v4.1 published by MLCommons, November 2024.

Data Center Offline Scenario — Llama 2 70B

Best result: NVIDIA B200 at ~210 samples/second. H200 achieves 118.5 (40% faster than H100's 84.2). A100 achieves 32.1 — 2.6x slower than H100.
AcceleratorModelSamples/SecondSystemSubmitter
B200 SXM (8×) BestLlama 2 70B~210DGX B200NVIDIA
H200 SXM (8×)Llama 2 70B118.5HGX H200NVIDIA
H100 SXM (8×)Llama 2 70B84.2DGX H100NVIDIA
TPU v5e (8×)Llama 2 70B45.3Cloud TPUGoogle
A100 SXM (8×)Llama 2 70B32.1DGX A100NVIDIA

Source: MLCommons official results, November 2024.

What Is H100 Inference Speed in Tokens Per Second?

NVIDIA H100 SXM: ~85 tokens/sec with vLLM (FP16, batch=1), ~110 tokens/sec with TensorRT-LLM for Llama 2 70B. H200: 120–155 tokens/sec. INT4 quantization: ~190–195 tokens/sec on single H100.
GPUFrameworkTokens/secBatch SizePrecision
H200 SXM 141GBTensorRT-LLM1551FP16
H200 SXM 141GBvLLM~1201FP16
H100 SXM 80GBTensorRT-LLM~1101FP16
H100 SXM 80GBvLLM~851FP16
A100 SXM 80GBvLLM~351FP16

How Does Quantization Affect Inference Performance?

FP8 roughly doubles throughput vs FP16 with <1% quality degradation. INT4 (AWQ) allows Llama 2 70B on a single H100 80GB.

PrecisionTokens/sec (H100 SXM)Memory (Llama 2 70B)Quality Loss
FP16 (baseline)~85~140 GB (2 GPUs)Baseline
FP8 (E4M3)~155~70 GB (1 GPU)<1% perplexity
INT8 (W8A8)~140~70 GB (1 GPU)~1% perplexity
INT4 (GPTQ)~190~35 GB (1 GPU)2–3% perplexity
INT4 (AWQ)~195~35 GB (1 GPU)1–2% perplexity

Which Cloud GPU Has the Best Performance Per Dollar?

Best performance-per-dollar: H200 on Lambda Labs ($3.29/hr) at 169,600 tokens/dollar — 1.28× better than H100 on Lambda Labs (132,400 tokens/dollar).
GPUCloud & RateTokens/secTokens/DollarRelative Value
H200 (Lambda)$3.29/hr~155~169,6001.28×
H100 (Lambda)$2.99/hr~110~132,4001.0× (baseline)
H100 (AWS P5)$3.93/hr~110~100,8000.76×
A100 (Lambda)$1.79/hr~35~70,4000.53×

Frequently Asked Questions

How many tokens per second does H100 generate for Llama 2 70B?

NVIDIA H100 SXM: ~85 tokens/sec with vLLM, ~110 tokens/sec with TensorRT-LLM for Llama 2 70B at FP16 batch=1. FP8 quantization improves this to ~155 tokens/sec.

Which inference framework is fastest: vLLM or TensorRT-LLM?

TensorRT-LLM is fastest for single-GPU throughput (110 vs 85 tokens/sec — 29% faster). vLLM is better for multi-model serving and flexible deployment.

How does H200 compare to H100 for inference speed?

H200 is ~40% faster than H100 for Llama 2 70B — 155 vs 110 tokens/sec with TensorRT-LLM. Speedup comes from 43% more bandwidth (4.8 vs 3.35 TB/s).

Is FP8 quantization worth using for inference?

Yes. FP8 delivers ~155 tokens/sec vs ~85 for FP16 (1.8× improvement) on H100, using only 70 GB — fits on a single GPU. Quality loss is <1% perplexity.