What Are the Biggest Levers for Reducing GPU Cloud Costs?

These techniques compound. Right-sizing + quantization + continuous batching together can reduce cost per token by 80–90% compared to a naive deployment running an oversized GPU at 40% utilization in FP16. Start with the highest savings, lowest effort techniques first.
TechniquePotential SavingsEffortNotes
Right-sizing GPU tier60–70%LowH100 → L4 for 7B models
FP8 quantization~50%Low2× H100 → 1× H100 for 70B
INT4 quantization (AWQ)~75%Medium75% VRAM reduction; 1–2% quality loss
Spot vs on-demand45–60%MediumH100: $1.87 vs $3.50/hr
Reserved vs on-demand40–46%Low (commit)AWS H100: $1.90–2.10/hr 1yr
Continuous batching~50% per-tokenMediumGPU util: 40% → 90%+
Provider arbitrage36–52%MediumAWS $3.90 → RunPod/Vast $1.87
MIG partitioningUp to 7× densityMedium1 GPU → 7 workloads
Auto-shutdown idle33% monthlyLowScale to zero when idle

Which GPU Should I Actually Use for My Workload?

The most common and expensive mistake is defaulting to the most powerful GPU. A team running a 7B model on H100s is spending 4–6× more than necessary. Select GPU based on model size, memory requirements, and whether the workload is memory-bound or compute-bound.

Model SizeFP16 VRAM NeededRecommended GPUOn-Demand Cost/hrUse Case
≤3B params≤6 GBL4 (24 GB)$0.50–0.80Cheapest inference
7B–13B params14–28 GBL4 or L40S (48 GB)$0.50–1.20L4 wins on cost/token
13B–30B params28–60 GBA100 40GB or L40S$1.20–2.00L40S for inference
30B–70B params60–140 GBA100 80GB or H100$1.90–3.90Consider INT4 to fit 1 card
70B–130B params140–260 GBH200 141GB or 2× H100$3.90–8.00H200 often cheaper overall
130B+ params>260 GBMulti-H100/H200 cluster$8.00+Evaluate quantization first

Key insight: H100 SXM delivers 86% lower training cost per token vs A100 PCIe ($0.88 vs $6.32 per 10M tokens) despite higher hourly rate — because MFU is significantly higher on SXM configurations.

How Much Can Quantization Save, and What Do I Lose?

Quantization is the highest-ROI optimization available. FP8 is production-ready with near-zero quality loss. INT4 AWQ reduces VRAM by 75% with only 1–2% perplexity increase.

MethodBaselineOptimizedSavingsQuality LossEffort
FP8 (vs FP16)2× H100 for 70B1× H100 (50% fewer GPUs)~50%<1% perplexityLow
INT8 W8A8 (vs FP16)2× H100 for 70B1× H100~50%~1% perplexityLow
INT4 GPTQ (vs FP16)2× A100 80GB1× A100 40GB~75%2–3% perplexityMedium
INT4 AWQ (vs FP16)2× A100 80GB1× A100 40GB~75%1–2% perplexityMedium

How Do I Use Spot Instances Without Losing Training Progress?

Spot/preemptible GPU instances save 35–60% vs on-demand but can be interrupted with 2-minute notice. The key is checkpointing every 30–60 minutes so training can resume from the last checkpoint.

  • Always checkpoint to persistent storage (S3, GCS, NFS) — not local disk
  • Set checkpoint frequency: every 1,000–5,000 steps or every 30–60 minutes
  • Use fault-tolerant training frameworks: PyTorch Elastic, TorchX, Ray Train
  • Do not use spot for inference — user-facing latency is unpredictable with interruptions
  • Monitor spot price trends before starting long runs; avoid peak demand windows
GPUOn-Demand AvgSpot AvgTypical SavingsProvider
H100 SXM 80GB$3.50$2.2535–40%AWS/GCP
H100 SXM 80GB$2.49$1.8924%RunPod
A100 SXM 80GB$2.50$1.4040–45%AWS/GCP
L40S 48GB$0.79$0.40~49%RunPod

Frequently Asked Questions

What is the single best way to reduce GPU cloud costs?

The single best lever is GPU right-sizing: running a 7B model on an L4 ($0.50–0.80/hr) instead of H100 ($3.50/hr) saves 60–70% with no quality loss. Most teams overprovision by 2–4× because they default to the most powerful hardware available. Always match GPU to model size and workload type first.

When does a reserved GPU instance pay off?

Reserved instances pay off when you need the same GPU for a predictable workload for 12+ months with 50%+ average utilization. 1-year commits save 25–40%; 3-year commits save 40–60%. AWS H100 with 1-year reservation: ~$1.90–2.10/hr vs $3.93/hr on-demand (47–53% savings). Break-even vs on-demand: typically 3–4 months of continuous usage.

How does continuous batching reduce per-token inference cost?

Continuous batching (aka in-flight batching, supported by vLLM and TensorRT-LLM) processes multiple requests simultaneously, improving GPU utilization from ~40% to 90%+ — reducing cost per token by approximately 50%. Without batching, a GPU sits idle between tokens; continuous batching fills those gaps with other requests. Batch size of 32 reduces per-token cost by ~85% vs single-request inference.

Is it worth shopping across multiple GPU cloud providers?

Yes — there is a nearly 5× price spread for identical H100 hardware across providers in 2026: $1.87/hr (Vast.ai market) to ~$6.15/hr (CoreWeave). For non-SLA-critical workloads (training, batch inference), switching from AWS P5 ($3.93/hr) to Lambda Labs ($2.99/hr) or RunPod ($2.49/hr) saves 24–37% with minimal engineering effort. The main tradeoff is fewer enterprise features and potentially less reliable SLAs.