How Much Did It Cost to Train Major AI Models?

DeepSeek V3 ($5.5M) is notably cheaper than GPT-4 (~$63M) — a ~10x reduction demonstrating efficiency gains via mixture-of-experts and optimized training pipelines. LLaMA 3 70B cost ~$7.7M using 16,000 H100s. Gemini Ultra (~$200M) remains the most expensive reported training run.
ModelOrgYearGPUsGPU-HoursCompute Cost (Est.)
GPT-3 175BOpenAI202010K A100s~3.1M~$4.6M
LLaMA 2 70BMeta20232K A100s~1.7M~$2.1M
Mistral 7BMistral2023512 H100s~200K~$240K
Mixtral 8×7BMistral2023512 H100s~600K~$720K
Falcon 180BTII20234K A100s~7.5M~$9M
GPT-4 (est.)OpenAI2023~25K A100s~50M~$63M
LLaMA 3 70BMeta202416K H100s~6.4M~$7.7M
LLaMA 3 405BMeta202416K H100s~30.8M~$37M
Claude 3 Opus (est.)Anthropic2024~30K H100s~60M~$72M
DeepSeek V3DeepSeek20242K H800s~2.7M~$5.5M
DeepSeek R1DeepSeek20252K H800s~2.7M~$5.5M
Gemini Ultra (est.)Google2023~32K TPUv4~80M~$200M

Note: Costs are estimates based on public disclosures and compute estimates. Actual costs vary based on negotiated contracts, energy, and cluster efficiency.

How Do You Calculate AI Training Cost?

Use the Chinchilla scaling law GPU-hour formula to estimate training compute requirements:

Total Cost = GPU_Hours × Hourly_GPU_Rate

GPU_Hours = (6 × Parameters × Tokens) / (GPU_FLOPS × MFU × 3600)

Where:
  Parameters = model parameter count
  Tokens     = training tokens
  GPU_FLOPS  = GPU peak FLOPS (BF16)
  MFU        = Model FLOP Utilization (typically 0.35–0.55)
  3600       = seconds per hour

Example: 7B model on 1T tokens with H100s (989 TFLOPS, MFU=0.45):
  GPU_Hours = (6 × 7e9 × 1e12) / (989e12 × 0.45 × 3600) = 26,249
  With 256 H100s: 26,249 / 256 = 102 hours (~4.3 days)
  Cost at $98.32/hr (256× AWS P5): 256 × 102 × $98.32 = ~$2.57M

Reference GPU FLOPS (BF16) and Typical MFU

GPUBF16 TFLOPSTypical MFU
NVIDIA H100 SXM9890.45–0.55
NVIDIA H200 SXM9890.45–0.55
NVIDIA B200 SXM2,2500.50–0.60
NVIDIA A100 80GB SXM3120.35–0.45
AMD MI300X1,3070.40–0.50
Intel Gaudi 31,8350.35–0.45

How Much Does Fine-Tuning an LLM Cost?

Fine-tuning costs vary enormously by method. QLoRA on a 70B model can cost as little as $196 for a small dataset, while full PPO RLHF training costs $990K+.

Instruction Fine-Tuning Costs (AWS P5 H100 rates)

Base ModelDatasetMethodGPUsDurationCost
LLaMA 3 8B50K examplesFull FT8× H100~3 hrs~$2.4K
LLaMA 3 8B1M examplesFull FT8× H100~2 days~$38K
LLaMA 3 70B50K examplesFull FT64× H100~4 hrs~$25K
LLaMA 3 70B1M examplesFull FT64× H100~3 days~$450K
LLaMA 3 405B50K examplesFull FT256× H100~8 hrs~$200K

QLoRA (Parameter-Efficient) Fine-Tuning Costs

Base ModelDatasetMethodGPUsDurationCost
LLaMA 3 8B50K examplesQLoRA1× A100 40GB~6 hrs~$9.70
LLaMA 3 8B1M examplesQLoRA2× A100 80GB~18 hrs~$58
LLaMA 3 70B50K examplesQLoRA2× H100 80GB~8 hrs~$196
LLaMA 3 70B1M examplesQLoRA4× H100 80GB~3 days~$3.5K

Frequently Asked Questions

How much did it cost to train LLaMA 3 70B?

Meta trained LLaMA 3 70B using approximately 16,000 H100 GPUs for ~6.4 million GPU-hours, with an estimated compute cost of ~$7.7 million. This was trained on a 15T token dataset. A fine-tuning run using QLoRA (4× H100, 3 days, 1M examples) costs approximately $3,500.

How much did DeepSeek V3 cost to train compared to GPT-4?

DeepSeek V3 training is estimated at ~$5.5 million (2,000 H800 GPUs, ~2.7M GPU-hours). GPT-4 is estimated at ~$63 million (~25,000 A100 GPUs, ~50M GPU-hours). DeepSeek achieved ~10x cost reduction through mixture-of-experts architecture (671B total params, ~37B active), load-balanced routing, and FP8 mixed-precision training.

What percentage of AI training cost is GPU compute?

GPU compute is the largest cost driver at 60–75% of total training cost. Other costs: data collection and cleaning 10–20% (often underestimated), engineering labor 5–15%, storage and networking 2–5%, evaluation and iteration 5–10%, and energy (on-prem) 5–10% (~$0.06–$0.12/kWh).

How have AI training costs changed over time?

H100 training cost per token has fallen approximately 40% from Q1 2023 to Q1 2026: from ~$5,200/billion tokens at peak scarcity pricing (~$40+/hr) to ~$2,700/billion tokens at Q1 2026 rates (~$22–25/hr effective). B200/GB200 deployment will further reduce training cost per token by 3–5× due to higher FLOPS density and larger memory capacity.