Quick Reference: GPU Requirements by Model (FP16 Inference)

ModelParametersVRAM Required (FP16)Minimum GPU ConfigRecommended Config
LLaMA 3 8B8B~16 GB1× RTX 4090 (24GB)1× A100 40GB
LLaMA 3 70B70B~140 GB2× H100 80GB2× H100 80GB or 1× H200
LLaMA 3 405B405B~810 GB11× H100 80GB12× H100 80GB
Mistral 7B7B~14 GB1× RTX 4090 (24GB)1× A100 40GB
Mixtral 8×7B47B active~94 GB2× A100 80GB2× H100 80GB
Mixtral 8×22B141B active~282 GB4× H100 80GB4× H100 80GB
DeepSeek V3671B~1,340 GB17× H100 80GB24× H100 80GB
Yi 34B34B~68 GB1× H100 80GB1× H100 80GB
Falcon 180B180B~360 GB5× H100 80GB8× H100 80GB
GPT-4 (est.)~1.8T~3,600 GB48× H100 80GB64× H100 80GB
Claude 3 Opus (est.)~2T~4,000 GB52× H100 80GB64× H100 80GB

How Do You Calculate GPU Memory Requirements for an LLM?

Formula: VRAM (GB) = (Parameters × bytes_per_param) / 1,000,000,000 × 1.2 (overhead)
FP32 = 4 bytes | BF16/FP16 = 2 bytes | INT8 = 1 byte | INT4/NF4 = 0.5 bytes
VRAM (GB) = (Parameters × bytes_per_param) / 1e9

Precision bytes:
  FP32  = 4 bytes per parameter
  BF16  = 2 bytes per parameter
  FP16  = 2 bytes per parameter
  INT8  = 1 byte per parameter
  INT4  = 0.5 bytes per parameter
  NF4   = 0.5 bytes per parameter

Add ~20% overhead for KV cache, activations, runtime

Examples:
  LLaMA 3 70B in FP16: (70e9 × 2) / 1e9 = 140 GB + 20% = ~168 GB
  LLaMA 3 70B in INT4: (70e9 × 0.5) / 1e9 = 35 GB + 20% = ~42 GB

Context Length Impact on VRAM

KV cache memory scales with sequence length. Long contexts add significant VRAM beyond model weights:

ModelContext LengthKV Cache (batch=1)
LLaMA 3 8B8K tokens~0.5 GB
LLaMA 3 8B128K tokens~7.5 GB
LLaMA 3 70B8K tokens~3.5 GB
LLaMA 3 70B128K tokens~56 GB
LLaMA 3 405B128K tokens~320 GB

How Much VRAM Does Training Require?

Training requires significantly more VRAM than inference due to optimizer states, gradients, and activations. Full fine-tuning requires approximately 18× the parameter count in bytes (mixed precision with AdamW optimizer).

Full Fine-Tuning Memory Requirements (AdamW)

PrecisionMemory MultiplierFormulaLLaMA 3 8B Example
FP3216× params16 bytes/param~128 GB (2× A100 80GB)
Mixed BF16+FP3218× params18 bytes/param~144 GB (2× A100 80GB)
Pure BF1612× params12 bytes/param~96 GB (2× A100 80GB)

Parameter-Efficient Fine-Tuning (PEFT)

QLoRA dramatically reduces VRAM for fine-tuning: LLaMA 3 70B with QLoRA requires only ~37 GB — fitting on a single H100 80GB. This is the most practical approach for fine-tuning large models.

MethodMemory OverheadNotesLLaMA 3 70B VRAM
LoRA (r=16)+~5% base modelTrains only low-rank matrices~147 GB (FP16 base)
QLoRA (4-bit + LoRA)~0.5× params4-bit base + FP16 adapters~37 GB (1× H100 80GB)
IA3+~1% base modelFewer trainable params than LoRA~141 GB (FP16 base)
Prefix Tuning+~2% base modelTrainable prefix tokens~143 GB (FP16 base)

GPU Configurations for Popular Models

LLaMA 3 70B — GPU Configuration Options

Use CaseGPU ConfigThroughput
Inference (FP16)2× H100 80GB~2,100 tok/s
Inference (FP16)2× A100 80GB~1,200 tok/s
Inference (INT4)1× H100 80GB~1,800 tok/s
Inference (FP16)1× H200 141GB~2,800 tok/s
Fine-tune (QLoRA)2× A100 80GB~400 tok/s training
Fine-tune (Full FP16)8× A100 80GB~1,100 tok/s training

LLaMA 3 405B — GPU Configuration Options

Use CaseGPU ConfigThroughput
Inference (FP16)8× H100 80GB~580 tok/s
Inference (INT4)4× H100 80GB~640 tok/s
Fine-tune (Full BF16)64× H100 80GB~800 tok/s training

Frequently Asked Questions

How much VRAM does LLaMA 3 70B require?

LLaMA 3 70B requires approximately 140 GB VRAM for FP16 inference (70B × 2 bytes = 140 GB), plus ~20% overhead for KV cache = ~168 GB total. Minimum: 2× H100 80GB (160 GB) or 1× H200 141GB. With INT4 quantization (AWQ), VRAM drops to ~35 GB — fitting on a single H100 80GB.

Can LLaMA 3 70B run on a single H100?

Yes, with quantization. LLaMA 3 70B runs on a single H100 80GB with INT4 quantization (GPTQ or AWQ), using approximately 35 GB VRAM — well within the 80 GB capacity. Throughput is approximately 1,800 tokens/second vs 2,100 tokens/second for FP16 on 2× H100, with 1–2% quality degradation. FP16 inference requires 2× H100 80GB (160 GB total).

What is QLoRA and how much VRAM does it use for 70B fine-tuning?

QLoRA combines 4-bit quantization of the base model with FP16 LoRA adapters. For LLaMA 3 70B: base model in 4-bit = ~35 GB + LoRA adapters ~2 GB = ~37 GB total — fitting on a single H100 80GB. This compares to ~1,260 GB for full FP32 fine-tuning. QLoRA is the standard approach for fine-tuning 70B+ models on accessible hardware.

How many GPUs does LLaMA 3 405B need?

LLaMA 3 405B requires approximately 810 GB VRAM for FP16 inference. Minimum: 11× H100 80GB (880 GB). Recommended: 12× H100 80GB for headroom. With INT4 quantization, VRAM drops to ~202 GB (3× H100 80GB). Full fine-tuning requires 64× H100 80GB (18× parameters = ~7,290 GB minimum).

What is the difference between tensor parallelism and pipeline parallelism?

Tensor Parallelism (TP) splits individual layers across GPUs — requires high-bandwidth NVLink, scales linearly up to 8 GPUs, best for inference. Pipeline Parallelism (PP) splits model layers across GPU groups — lower bandwidth requirements, works across nodes, higher latency but enables very large models (4–16 pipeline stages). For production serving, use TP within a node (up to 8× H100) and PP across nodes.