How much VRAM does LLaMA 3 70B need?

LLaMA 3 70B requires approximately 140 GB VRAM for FP16 inference (70B × 2 bytes), plus ~20% overhead = ~168 GB. Minimum configuration: 2× H100 80GB or 1× H200 141GB. With INT4 quantization (AWQ), fits on a single H100 80GB using ~35 GB VRAM.

How many GPUs does LLaMA 3 405B require?

LLaMA 3 405B requires approximately 810 GB VRAM for FP16 inference. Minimum: 11× H100 80GB (880 GB). Recommended: 12× H100 80GB. With INT4 quantization, ~202 GB (3× H100 80GB). Full fine-tuning: 64× H100 80GB minimum.

Model GPU Sizing Guide: How Many GPUs for LLaMA 3, Mixtral, GPT-4?

Quick Reference: GPU Requirements by Model (FP16 Inference)

Model	Parameters	VRAM Required (FP16)	Minimum GPU Config	Recommended Config
LLaMA 3 8B	8B	~16 GB	1× RTX 4090 (24GB)	1× A100 40GB
LLaMA 3 70B	70B	~140 GB	2× H100 80GB	2× H100 80GB or 1× H200
LLaMA 3 405B	405B	~810 GB	11× H100 80GB	12× H100 80GB
Mistral 7B	7B	~14 GB	1× RTX 4090 (24GB)	1× A100 40GB
Mixtral 8×7B	47B active	~94 GB	2× A100 80GB	2× H100 80GB
Mixtral 8×22B	141B active	~282 GB	4× H100 80GB	4× H100 80GB
DeepSeek V3	671B	~1,340 GB	17× H100 80GB	24× H100 80GB
Yi 34B	34B	~68 GB	1× H100 80GB	1× H100 80GB
Falcon 180B	180B	~360 GB	5× H100 80GB	8× H100 80GB
GPT-4 (est.)	~1.8T	~3,600 GB	48× H100 80GB	64× H100 80GB
Claude 3 Opus (est.)	~2T	~4,000 GB	52× H100 80GB	64× H100 80GB

How Do You Calculate GPU Memory Requirements for an LLM?

Formula: VRAM (GB) = (Parameters × bytes_per_param) / 1,000,000,000 × 1.2 (overhead)
FP32 = 4 bytes | BF16/FP16 = 2 bytes | INT8 = 1 byte | INT4/NF4 = 0.5 bytes

VRAM (GB) = (Parameters × bytes_per_param) / 1e9

Precision bytes:
  FP32  = 4 bytes per parameter
  BF16  = 2 bytes per parameter
  FP16  = 2 bytes per parameter
  INT8  = 1 byte per parameter
  INT4  = 0.5 bytes per parameter
  NF4   = 0.5 bytes per parameter

Add ~20% overhead for KV cache, activations, runtime

Examples:
  LLaMA 3 70B in FP16: (70e9 × 2) / 1e9 = 140 GB + 20% = ~168 GB
  LLaMA 3 70B in INT4: (70e9 × 0.5) / 1e9 = 35 GB + 20% = ~42 GB

Context Length Impact on VRAM

KV cache memory scales with sequence length. Long contexts add significant VRAM beyond model weights:

Model	Context Length	KV Cache (batch=1)
LLaMA 3 8B	8K tokens	~0.5 GB
LLaMA 3 8B	128K tokens	~7.5 GB
LLaMA 3 70B	8K tokens	~3.5 GB
LLaMA 3 70B	128K tokens	~56 GB
LLaMA 3 405B	128K tokens	~320 GB

How Much VRAM Does Training Require?

Training requires significantly more VRAM than inference due to optimizer states, gradients, and activations. Full fine-tuning requires approximately 18× the parameter count in bytes (mixed precision with AdamW optimizer).

Full Fine-Tuning Memory Requirements (AdamW)

Precision	Memory Multiplier	Formula	LLaMA 3 8B Example
FP32	16× params	16 bytes/param	~128 GB (2× A100 80GB)
Mixed BF16+FP32	18× params	18 bytes/param	~144 GB (2× A100 80GB)
Pure BF16	12× params	12 bytes/param	~96 GB (2× A100 80GB)

Parameter-Efficient Fine-Tuning (PEFT)

QLoRA dramatically reduces VRAM for fine-tuning: LLaMA 3 70B with QLoRA requires only ~37 GB — fitting on a single H100 80GB. This is the most practical approach for fine-tuning large models.

Method	Memory Overhead	Notes	LLaMA 3 70B VRAM
LoRA (r=16)	+~5% base model	Trains only low-rank matrices	~147 GB (FP16 base)
QLoRA (4-bit + LoRA)	~0.5× params	4-bit base + FP16 adapters	~37 GB (1× H100 80GB)
IA3	+~1% base model	Fewer trainable params than LoRA	~141 GB (FP16 base)
Prefix Tuning	+~2% base model	Trainable prefix tokens	~143 GB (FP16 base)

GPU Configurations for Popular Models

LLaMA 3 70B — GPU Configuration Options

Use Case	GPU Config	Throughput
Inference (FP16)	2× H100 80GB	~2,100 tok/s
Inference (FP16)	2× A100 80GB	~1,200 tok/s
Inference (INT4)	1× H100 80GB	~1,800 tok/s
Inference (FP16)	1× H200 141GB	~2,800 tok/s
Fine-tune (QLoRA)	2× A100 80GB	~400 tok/s training
Fine-tune (Full FP16)	8× A100 80GB	~1,100 tok/s training

LLaMA 3 405B — GPU Configuration Options

Use Case	GPU Config	Throughput
Inference (FP16)	8× H100 80GB	~580 tok/s
Inference (INT4)	4× H100 80GB	~640 tok/s
Fine-tune (Full BF16)	64× H100 80GB	~800 tok/s training

Frequently Asked Questions

How much VRAM does LLaMA 3 70B require?

LLaMA 3 70B requires approximately 140 GB VRAM for FP16 inference (70B × 2 bytes = 140 GB), plus ~20% overhead for KV cache = ~168 GB total. Minimum: 2× H100 80GB (160 GB) or 1× H200 141GB. With INT4 quantization (AWQ), VRAM drops to ~35 GB — fitting on a single H100 80GB.

Can LLaMA 3 70B run on a single H100?

Yes, with quantization. LLaMA 3 70B runs on a single H100 80GB with INT4 quantization (GPTQ or AWQ), using approximately 35 GB VRAM — well within the 80 GB capacity. Throughput is approximately 1,800 tokens/second vs 2,100 tokens/second for FP16 on 2× H100, with 1–2% quality degradation. FP16 inference requires 2× H100 80GB (160 GB total).

What is QLoRA and how much VRAM does it use for 70B fine-tuning?

QLoRA combines 4-bit quantization of the base model with FP16 LoRA adapters. For LLaMA 3 70B: base model in 4-bit = ~35 GB + LoRA adapters ~2 GB = ~37 GB total — fitting on a single H100 80GB. This compares to ~1,260 GB for full FP32 fine-tuning. QLoRA is the standard approach for fine-tuning 70B+ models on accessible hardware.

How many GPUs does LLaMA 3 405B need?

LLaMA 3 405B requires approximately 810 GB VRAM for FP16 inference. Minimum: 11× H100 80GB (880 GB). Recommended: 12× H100 80GB for headroom. With INT4 quantization, VRAM drops to ~202 GB (3× H100 80GB). Full fine-tuning requires 64× H100 80GB (18× parameters = ~7,290 GB minimum).

What is the difference between tensor parallelism and pipeline parallelism?

Tensor Parallelism (TP) splits individual layers across GPUs — requires high-bandwidth NVLink, scales linearly up to 8 GPUs, best for inference. Pipeline Parallelism (PP) splits model layers across GPU groups — lower bandwidth requirements, works across nodes, higher latency but enables very large models (4–16 pipeline stages). For production serving, use TP within a node (up to 8× H100) and PP across nodes.