Model GPU Sizing Guide
How many GPUs does your model need? GPU memory and compute requirements for training and deploying major AI models across all major GPU types. Covers LLaMA 3 (8B/70B/405B), Mixtral, GPT-4, DeepSeek V3 across FP16, INT8, and INT4 quantization.
Quick Reference: GPU Requirements by Model (FP16 Inference)
| Model | Parameters | VRAM Required (FP16) | Minimum GPU Config | Recommended Config |
|---|---|---|---|---|
| LLaMA 3 8B | 8B | ~16 GB | 1× RTX 4090 (24GB) | 1× A100 40GB |
| LLaMA 3 70B | 70B | ~140 GB | 2× H100 80GB | 2× H100 80GB or 1× H200 |
| LLaMA 3 405B | 405B | ~810 GB | 11× H100 80GB | 12× H100 80GB |
| Mistral 7B | 7B | ~14 GB | 1× RTX 4090 (24GB) | 1× A100 40GB |
| Mixtral 8×7B | 47B active | ~94 GB | 2× A100 80GB | 2× H100 80GB |
| Mixtral 8×22B | 141B active | ~282 GB | 4× H100 80GB | 4× H100 80GB |
| DeepSeek V3 | 671B | ~1,340 GB | 17× H100 80GB | 24× H100 80GB |
| Yi 34B | 34B | ~68 GB | 1× H100 80GB | 1× H100 80GB |
| Falcon 180B | 180B | ~360 GB | 5× H100 80GB | 8× H100 80GB |
| GPT-4 (est.) | ~1.8T | ~3,600 GB | 48× H100 80GB | 64× H100 80GB |
| Claude 3 Opus (est.) | ~2T | ~4,000 GB | 52× H100 80GB | 64× H100 80GB |
How Do You Calculate GPU Memory Requirements for an LLM?
FP32 = 4 bytes | BF16/FP16 = 2 bytes | INT8 = 1 byte | INT4/NF4 = 0.5 bytes
VRAM (GB) = (Parameters × bytes_per_param) / 1e9
Precision bytes:
FP32 = 4 bytes per parameter
BF16 = 2 bytes per parameter
FP16 = 2 bytes per parameter
INT8 = 1 byte per parameter
INT4 = 0.5 bytes per parameter
NF4 = 0.5 bytes per parameter
Add ~20% overhead for KV cache, activations, runtime
Examples:
LLaMA 3 70B in FP16: (70e9 × 2) / 1e9 = 140 GB + 20% = ~168 GB
LLaMA 3 70B in INT4: (70e9 × 0.5) / 1e9 = 35 GB + 20% = ~42 GB
Context Length Impact on VRAM
KV cache memory scales with sequence length. Long contexts add significant VRAM beyond model weights:
| Model | Context Length | KV Cache (batch=1) |
|---|---|---|
| LLaMA 3 8B | 8K tokens | ~0.5 GB |
| LLaMA 3 8B | 128K tokens | ~7.5 GB |
| LLaMA 3 70B | 8K tokens | ~3.5 GB |
| LLaMA 3 70B | 128K tokens | ~56 GB |
| LLaMA 3 405B | 128K tokens | ~320 GB |
How Much VRAM Does Training Require?
Training requires significantly more VRAM than inference due to optimizer states, gradients, and activations. Full fine-tuning requires approximately 18× the parameter count in bytes (mixed precision with AdamW optimizer).
Full Fine-Tuning Memory Requirements (AdamW)
| Precision | Memory Multiplier | Formula | LLaMA 3 8B Example |
|---|---|---|---|
| FP32 | 16× params | 16 bytes/param | ~128 GB (2× A100 80GB) |
| Mixed BF16+FP32 | 18× params | 18 bytes/param | ~144 GB (2× A100 80GB) |
| Pure BF16 | 12× params | 12 bytes/param | ~96 GB (2× A100 80GB) |
Parameter-Efficient Fine-Tuning (PEFT)
QLoRA dramatically reduces VRAM for fine-tuning: LLaMA 3 70B with QLoRA requires only ~37 GB — fitting on a single H100 80GB. This is the most practical approach for fine-tuning large models.
| Method | Memory Overhead | Notes | LLaMA 3 70B VRAM |
|---|---|---|---|
| LoRA (r=16) | +~5% base model | Trains only low-rank matrices | ~147 GB (FP16 base) |
| QLoRA (4-bit + LoRA) | ~0.5× params | 4-bit base + FP16 adapters | ~37 GB (1× H100 80GB) |
| IA3 | +~1% base model | Fewer trainable params than LoRA | ~141 GB (FP16 base) |
| Prefix Tuning | +~2% base model | Trainable prefix tokens | ~143 GB (FP16 base) |
GPU Configurations for Popular Models
LLaMA 3 70B — GPU Configuration Options
| Use Case | GPU Config | Throughput |
|---|---|---|
| Inference (FP16) | 2× H100 80GB | ~2,100 tok/s |
| Inference (FP16) | 2× A100 80GB | ~1,200 tok/s |
| Inference (INT4) | 1× H100 80GB | ~1,800 tok/s |
| Inference (FP16) | 1× H200 141GB | ~2,800 tok/s |
| Fine-tune (QLoRA) | 2× A100 80GB | ~400 tok/s training |
| Fine-tune (Full FP16) | 8× A100 80GB | ~1,100 tok/s training |
LLaMA 3 405B — GPU Configuration Options
| Use Case | GPU Config | Throughput |
|---|---|---|
| Inference (FP16) | 8× H100 80GB | ~580 tok/s |
| Inference (INT4) | 4× H100 80GB | ~640 tok/s |
| Fine-tune (Full BF16) | 64× H100 80GB | ~800 tok/s training |
Frequently Asked Questions
How much VRAM does LLaMA 3 70B require?
LLaMA 3 70B requires approximately 140 GB VRAM for FP16 inference (70B × 2 bytes = 140 GB), plus ~20% overhead for KV cache = ~168 GB total. Minimum: 2× H100 80GB (160 GB) or 1× H200 141GB. With INT4 quantization (AWQ), VRAM drops to ~35 GB — fitting on a single H100 80GB.
Can LLaMA 3 70B run on a single H100?
Yes, with quantization. LLaMA 3 70B runs on a single H100 80GB with INT4 quantization (GPTQ or AWQ), using approximately 35 GB VRAM — well within the 80 GB capacity. Throughput is approximately 1,800 tokens/second vs 2,100 tokens/second for FP16 on 2× H100, with 1–2% quality degradation. FP16 inference requires 2× H100 80GB (160 GB total).
What is QLoRA and how much VRAM does it use for 70B fine-tuning?
QLoRA combines 4-bit quantization of the base model with FP16 LoRA adapters. For LLaMA 3 70B: base model in 4-bit = ~35 GB + LoRA adapters ~2 GB = ~37 GB total — fitting on a single H100 80GB. This compares to ~1,260 GB for full FP32 fine-tuning. QLoRA is the standard approach for fine-tuning 70B+ models on accessible hardware.
How many GPUs does LLaMA 3 405B need?
LLaMA 3 405B requires approximately 810 GB VRAM for FP16 inference. Minimum: 11× H100 80GB (880 GB). Recommended: 12× H100 80GB for headroom. With INT4 quantization, VRAM drops to ~202 GB (3× H100 80GB). Full fine-tuning requires 64× H100 80GB (18× parameters = ~7,290 GB minimum).
What is the difference between tensor parallelism and pipeline parallelism?
Tensor Parallelism (TP) splits individual layers across GPUs — requires high-bandwidth NVLink, scales linearly up to 8 GPUs, best for inference. Pipeline Parallelism (PP) splits model layers across GPU groups — lower bandwidth requirements, works across nodes, higher latency but enables very large models (4–16 pipeline stages). For production serving, use TP within a node (up to 8× H100) and PP across nodes.