What is the fastest GPU for LLM inference?

NVIDIA H200 is fastest single GPU: 155 tokens/sec for Llama 2 70B. Groq LPU achieves 500+ tokens/sec but requires model compilation.

AI Inference Benchmarks: MLPerf Results, Tokens/Sec, H100 vs H200

Q: How fast is H100 inference for Llama 2 70B?

NVIDIA H100 SXM: ~85 tokens/sec with vLLM, ~110 tokens/sec with TensorRT-LLM for Llama 2 70B at FP16 batch=1.

Q: Is FP8 quantization worth using for inference?

Yes. FP8 delivers ~155 tokens/sec vs ~85 FP16 (1.8x) on H100, using only 70 GB (fits 1 GPU). Quality loss less than 1% perplexity.

MLPerf Inference v4.1 — Official Results

MLPerf Inference v4.1 published by MLCommons, November 2024.

Data Center Offline Scenario — Llama 2 70B

Best result: NVIDIA B200 at ~210 samples/second. H200 achieves 118.5 (40% faster than H100's 84.2). A100 achieves 32.1 — 2.6x slower than H100.

Accelerator	Model	Samples/Second	System	Submitter
B200 SXM (8×) Best	Llama 2 70B	~210	DGX B200	NVIDIA
H200 SXM (8×)	Llama 2 70B	118.5	HGX H200	NVIDIA
H100 SXM (8×)	Llama 2 70B	84.2	DGX H100	NVIDIA
TPU v5e (8×)	Llama 2 70B	45.3	Cloud TPU	Google
A100 SXM (8×)	Llama 2 70B	32.1	DGX A100	NVIDIA

Source: MLCommons official results, November 2024.

What Is H100 Inference Speed in Tokens Per Second?

NVIDIA H100 SXM: ~85 tokens/sec with vLLM (FP16, batch=1), ~110 tokens/sec with TensorRT-LLM for Llama 2 70B. H200: 120–155 tokens/sec. INT4 quantization: ~190–195 tokens/sec on single H100.

GPU	Framework	Tokens/sec	Batch Size	Precision
H200 SXM 141GB	TensorRT-LLM	155	1	FP16
H200 SXM 141GB	vLLM	~120	1	FP16
H100 SXM 80GB	TensorRT-LLM	~110	1	FP16
H100 SXM 80GB	vLLM	~85	1	FP16
A100 SXM 80GB	vLLM	~35	1	FP16

How Does Quantization Affect Inference Performance?

FP8 roughly doubles throughput vs FP16 with <1% quality degradation. INT4 (AWQ) allows Llama 2 70B on a single H100 80GB.

Precision	Tokens/sec (H100 SXM)	Memory (Llama 2 70B)	Quality Loss
FP16 (baseline)	~85	~140 GB (2 GPUs)	Baseline
FP8 (E4M3)	~155	~70 GB (1 GPU)	<1% perplexity
INT8 (W8A8)	~140	~70 GB (1 GPU)	~1% perplexity
INT4 (GPTQ)	~190	~35 GB (1 GPU)	2–3% perplexity
INT4 (AWQ)	~195	~35 GB (1 GPU)	1–2% perplexity

Which Cloud GPU Has the Best Performance Per Dollar?

Best performance-per-dollar: H200 on Lambda Labs ($3.29/hr) at 169,600 tokens/dollar — 1.28× better than H100 on Lambda Labs (132,400 tokens/dollar).

GPU	Cloud & Rate	Tokens/sec	Tokens/Dollar	Relative Value
H200 (Lambda)	$3.29/hr	~155	~169,600	1.28×
H100 (Lambda)	$2.99/hr	~110	~132,400	1.0× (baseline)
H100 (AWS P5)	$3.93/hr	~110	~100,800	0.76×
A100 (Lambda)	$1.79/hr	~35	~70,400	0.53×

Frequently Asked Questions

How many tokens per second does H100 generate for Llama 2 70B?

NVIDIA H100 SXM: ~85 tokens/sec with vLLM, ~110 tokens/sec with TensorRT-LLM for Llama 2 70B at FP16 batch=1. FP8 quantization improves this to ~155 tokens/sec.

Which inference framework is fastest: vLLM or TensorRT-LLM?

TensorRT-LLM is fastest for single-GPU throughput (110 vs 85 tokens/sec — 29% faster). vLLM is better for multi-model serving and flexible deployment.

How does H200 compare to H100 for inference speed?

H200 is ~40% faster than H100 for Llama 2 70B — 155 vs 110 tokens/sec with TensorRT-LLM. Speedup comes from 43% more bandwidth (4.8 vs 3.35 TB/s).

Is FP8 quantization worth using for inference?

Yes. FP8 delivers ~155 tokens/sec vs ~85 for FP16 (1.8× improvement) on H100, using only 70 GB — fits on a single GPU. Quality loss is <1% perplexity.

AI Inference Benchmark Index