What is the biggest way to reduce GPU cloud costs?

The biggest cost reduction comes from GPU right-sizing: running a 7B model on an L4 ($0.50–0.80/hr) instead of H100 ($3.50/hr) saves 60–70%. Combined with FP8 quantization (50% fewer GPUs), continuous batching (50% per-token reduction), and spot instances (35–60% savings), total cost can be reduced 80–90% from a naive deployment.

GPU Cost Optimization Playbook

How to reduce GPU cloud costs 30–90% using right-sizing, quantization, spot instances, reserved pricing, continuous batching, and multi-cloud arbitrage. Based on 2025–2026 pricing data across 12 cloud providers. Includes specific savings percentages and implementation guidance.

Last updated: March 1, 2026 Source: gpu-cost-optimization-playbook.md Related: Cloud Pricing · Buy vs Rent

What Are the Biggest Levers for Reducing GPU Cloud Costs?

These techniques compound. Right-sizing + quantization + continuous batching together can reduce cost per token by 80–90% compared to a naive deployment running an oversized GPU at 40% utilization in FP16. Start with the highest savings, lowest effort techniques first.

Technique	Potential Savings	Effort	Notes
Right-sizing GPU tier	60–70%	Low	H100 → L4 for 7B models
FP8 quantization	~50%	Low	2× H100 → 1× H100 for 70B
INT4 quantization (AWQ)	~75%	Medium	75% VRAM reduction; 1–2% quality loss
Spot vs on-demand	45–60%	Medium	H100: $1.87 vs $3.50/hr
Reserved vs on-demand	40–46%	Low (commit)	AWS H100: $1.90–2.10/hr 1yr
Continuous batching	~50% per-token	Medium	GPU util: 40% → 90%+
Provider arbitrage	36–52%	Medium	AWS $3.90 → RunPod/Vast $1.87
MIG partitioning	Up to 7× density	Medium	1 GPU → 7 workloads
Auto-shutdown idle	33% monthly	Low	Scale to zero when idle

Which GPU Should I Actually Use for My Workload?

The most common and expensive mistake is defaulting to the most powerful GPU. A team running a 7B model on H100s is spending 4–6× more than necessary. Select GPU based on model size, memory requirements, and whether the workload is memory-bound or compute-bound.

Model Size	FP16 VRAM Needed	Recommended GPU	On-Demand Cost/hr	Use Case
≤3B params	≤6 GB	L4 (24 GB)	$0.50–0.80	Cheapest inference
7B–13B params	14–28 GB	L4 or L40S (48 GB)	$0.50–1.20	L4 wins on cost/token
13B–30B params	28–60 GB	A100 40GB or L40S	$1.20–2.00	L40S for inference
30B–70B params	60–140 GB	A100 80GB or H100	$1.90–3.90	Consider INT4 to fit 1 card
70B–130B params	140–260 GB	H200 141GB or 2× H100	$3.90–8.00	H200 often cheaper overall
130B+ params	>260 GB	Multi-H100/H200 cluster	$8.00+	Evaluate quantization first

Key insight: H100 SXM delivers 86% lower training cost per token vs A100 PCIe ($0.88 vs $6.32 per 10M tokens) despite higher hourly rate — because MFU is significantly higher on SXM configurations.

How Much Can Quantization Save, and What Do I Lose?

Quantization is the highest-ROI optimization available. FP8 is production-ready with near-zero quality loss. INT4 AWQ reduces VRAM by 75% with only 1–2% perplexity increase.

Method	Baseline	Optimized	Savings	Quality Loss	Effort
FP8 (vs FP16)	2× H100 for 70B	1× H100 (50% fewer GPUs)	~50%	<1% perplexity	Low
INT8 W8A8 (vs FP16)	2× H100 for 70B	1× H100	~50%	~1% perplexity	Low
INT4 GPTQ (vs FP16)	2× A100 80GB	1× A100 40GB	~75%	2–3% perplexity	Medium
INT4 AWQ (vs FP16)	2× A100 80GB	1× A100 40GB	~75%	1–2% perplexity	Medium

How Do I Use Spot Instances Without Losing Training Progress?

Spot/preemptible GPU instances save 35–60% vs on-demand but can be interrupted with 2-minute notice. The key is checkpointing every 30–60 minutes so training can resume from the last checkpoint.

Always checkpoint to persistent storage (S3, GCS, NFS) — not local disk
Set checkpoint frequency: every 1,000–5,000 steps or every 30–60 minutes
Use fault-tolerant training frameworks: PyTorch Elastic, TorchX, Ray Train
Do not use spot for inference — user-facing latency is unpredictable with interruptions
Monitor spot price trends before starting long runs; avoid peak demand windows

GPU	On-Demand Avg	Spot Avg	Typical Savings	Provider
H100 SXM 80GB	$3.50	$2.25	35–40%	AWS/GCP
H100 SXM 80GB	$2.49	$1.89	24%	RunPod
A100 SXM 80GB	$2.50	$1.40	40–45%	AWS/GCP
L40S 48GB	$0.79	$0.40	~49%	RunPod

Frequently Asked Questions

What is the single best way to reduce GPU cloud costs?

The single best lever is GPU right-sizing: running a 7B model on an L4 ($0.50–0.80/hr) instead of H100 ($3.50/hr) saves 60–70% with no quality loss. Most teams overprovision by 2–4× because they default to the most powerful hardware available. Always match GPU to model size and workload type first.

When does a reserved GPU instance pay off?

Reserved instances pay off when you need the same GPU for a predictable workload for 12+ months with 50%+ average utilization. 1-year commits save 25–40%; 3-year commits save 40–60%. AWS H100 with 1-year reservation: ~$1.90–2.10/hr vs $3.93/hr on-demand (47–53% savings). Break-even vs on-demand: typically 3–4 months of continuous usage.

How does continuous batching reduce per-token inference cost?

Continuous batching (aka in-flight batching, supported by vLLM and TensorRT-LLM) processes multiple requests simultaneously, improving GPU utilization from ~40% to 90%+ — reducing cost per token by approximately 50%. Without batching, a GPU sits idle between tokens; continuous batching fills those gaps with other requests. Batch size of 32 reduces per-token cost by ~85% vs single-request inference.

Is it worth shopping across multiple GPU cloud providers?

Yes — there is a nearly 5× price spread for identical H100 hardware across providers in 2026: $1.87/hr (Vast.ai market) to ~$6.15/hr (CoreWeave). For non-SLA-critical workloads (training, batch inference), switching from AWS P5 ($3.93/hr) to Lambda Labs ($2.99/hr) or RunPod ($2.49/hr) saves 24–37% with minimal engineering effort. The main tradeoff is fewer enterprise features and potentially less reliable SLAs.