GPU Cost Optimization Playbook
How to reduce GPU cloud costs 30–90% using right-sizing, quantization, spot instances, reserved pricing, continuous batching, and multi-cloud arbitrage. Based on 2025–2026 pricing data across 12 cloud providers. Includes specific savings percentages and implementation guidance.
What Are the Biggest Levers for Reducing GPU Cloud Costs?
| Technique | Potential Savings | Effort | Notes |
|---|---|---|---|
| Right-sizing GPU tier | 60–70% | Low | H100 → L4 for 7B models |
| FP8 quantization | ~50% | Low | 2× H100 → 1× H100 for 70B |
| INT4 quantization (AWQ) | ~75% | Medium | 75% VRAM reduction; 1–2% quality loss |
| Spot vs on-demand | 45–60% | Medium | H100: $1.87 vs $3.50/hr |
| Reserved vs on-demand | 40–46% | Low (commit) | AWS H100: $1.90–2.10/hr 1yr |
| Continuous batching | ~50% per-token | Medium | GPU util: 40% → 90%+ |
| Provider arbitrage | 36–52% | Medium | AWS $3.90 → RunPod/Vast $1.87 |
| MIG partitioning | Up to 7× density | Medium | 1 GPU → 7 workloads |
| Auto-shutdown idle | 33% monthly | Low | Scale to zero when idle |
Which GPU Should I Actually Use for My Workload?
The most common and expensive mistake is defaulting to the most powerful GPU. A team running a 7B model on H100s is spending 4–6× more than necessary. Select GPU based on model size, memory requirements, and whether the workload is memory-bound or compute-bound.
| Model Size | FP16 VRAM Needed | Recommended GPU | On-Demand Cost/hr | Use Case |
|---|---|---|---|---|
| ≤3B params | ≤6 GB | L4 (24 GB) | $0.50–0.80 | Cheapest inference |
| 7B–13B params | 14–28 GB | L4 or L40S (48 GB) | $0.50–1.20 | L4 wins on cost/token |
| 13B–30B params | 28–60 GB | A100 40GB or L40S | $1.20–2.00 | L40S for inference |
| 30B–70B params | 60–140 GB | A100 80GB or H100 | $1.90–3.90 | Consider INT4 to fit 1 card |
| 70B–130B params | 140–260 GB | H200 141GB or 2× H100 | $3.90–8.00 | H200 often cheaper overall |
| 130B+ params | >260 GB | Multi-H100/H200 cluster | $8.00+ | Evaluate quantization first |
Key insight: H100 SXM delivers 86% lower training cost per token vs A100 PCIe ($0.88 vs $6.32 per 10M tokens) despite higher hourly rate — because MFU is significantly higher on SXM configurations.
How Much Can Quantization Save, and What Do I Lose?
Quantization is the highest-ROI optimization available. FP8 is production-ready with near-zero quality loss. INT4 AWQ reduces VRAM by 75% with only 1–2% perplexity increase.
| Method | Baseline | Optimized | Savings | Quality Loss | Effort |
|---|---|---|---|---|---|
| FP8 (vs FP16) | 2× H100 for 70B | 1× H100 (50% fewer GPUs) | ~50% | <1% perplexity | Low |
| INT8 W8A8 (vs FP16) | 2× H100 for 70B | 1× H100 | ~50% | ~1% perplexity | Low |
| INT4 GPTQ (vs FP16) | 2× A100 80GB | 1× A100 40GB | ~75% | 2–3% perplexity | Medium |
| INT4 AWQ (vs FP16) | 2× A100 80GB | 1× A100 40GB | ~75% | 1–2% perplexity | Medium |
How Do I Use Spot Instances Without Losing Training Progress?
Spot/preemptible GPU instances save 35–60% vs on-demand but can be interrupted with 2-minute notice. The key is checkpointing every 30–60 minutes so training can resume from the last checkpoint.
- Always checkpoint to persistent storage (S3, GCS, NFS) — not local disk
- Set checkpoint frequency: every 1,000–5,000 steps or every 30–60 minutes
- Use fault-tolerant training frameworks: PyTorch Elastic, TorchX, Ray Train
- Do not use spot for inference — user-facing latency is unpredictable with interruptions
- Monitor spot price trends before starting long runs; avoid peak demand windows
| GPU | On-Demand Avg | Spot Avg | Typical Savings | Provider |
|---|---|---|---|---|
| H100 SXM 80GB | $3.50 | $2.25 | 35–40% | AWS/GCP |
| H100 SXM 80GB | $2.49 | $1.89 | 24% | RunPod |
| A100 SXM 80GB | $2.50 | $1.40 | 40–45% | AWS/GCP |
| L40S 48GB | $0.79 | $0.40 | ~49% | RunPod |
Frequently Asked Questions
What is the single best way to reduce GPU cloud costs?
The single best lever is GPU right-sizing: running a 7B model on an L4 ($0.50–0.80/hr) instead of H100 ($3.50/hr) saves 60–70% with no quality loss. Most teams overprovision by 2–4× because they default to the most powerful hardware available. Always match GPU to model size and workload type first.
When does a reserved GPU instance pay off?
Reserved instances pay off when you need the same GPU for a predictable workload for 12+ months with 50%+ average utilization. 1-year commits save 25–40%; 3-year commits save 40–60%. AWS H100 with 1-year reservation: ~$1.90–2.10/hr vs $3.93/hr on-demand (47–53% savings). Break-even vs on-demand: typically 3–4 months of continuous usage.
How does continuous batching reduce per-token inference cost?
Continuous batching (aka in-flight batching, supported by vLLM and TensorRT-LLM) processes multiple requests simultaneously, improving GPU utilization from ~40% to 90%+ — reducing cost per token by approximately 50%. Without batching, a GPU sits idle between tokens; continuous batching fills those gaps with other requests. Batch size of 32 reduces per-token cost by ~85% vs single-request inference.
Is it worth shopping across multiple GPU cloud providers?
Yes — there is a nearly 5× price spread for identical H100 hardware across providers in 2026: $1.87/hr (Vast.ai market) to ~$6.15/hr (CoreWeave). For non-SLA-critical workloads (training, batch inference), switching from AWS P5 ($3.93/hr) to Lambda Labs ($2.99/hr) or RunPod ($2.49/hr) saves 24–37% with minimal engineering effort. The main tradeoff is fewer enterprise features and potentially less reliable SLAs.