gcpspot-vmcomputecost-optimization

GCP Spot VMs: 60-91% off if you design for preemption

Spot VMs run 60-91% below on-demand for the same machine type, preemptible at any time. Here's what belongs on Spot, how it differs from preemptible, Spot GKE pools, and how to design for preemption.

The C3X Team··7 min read

Quick answer

GCP Spot VMs run 60-91% below on-demand for the same machine type, in exchange for being preemptible at any time with 30 seconds' notice. They're ideal for fault-tolerant, stateless, or checkpointable work — batch jobs, CI runners, pipelines, and Spot GKE node pools. Design for preemption (checkpoint, retry, spread across zones) and keep a small on-demand baseline for the parts that can't be interrupted.

Spot VMs are GCP's deepest compute discount and, for the right workload, close to free money: identical hardware, 60-91% off, with one condition — Google can take the VM back when it needs the capacity. The whole game is matching Spot to workloads that don't care if a node disappears.

The discount and the catch

Spot pricing is 60-91% below the on-demand rate for the same machine type (the exact figure varies by family and region). In return, Compute Engine can preempt the VM at any time with a 30-second warning. Spot VMs replaced preemptible VMs — same discount and preemption, but Spot removed the old 24-hour maximum runtime, so a Spot VM runs until it's either preempted or you stop it.

What belongs on Spot

  • Batch and data pipelines — checkpoint progress and a preemption just resumes from the last checkpoint.
  • CI/CD runners — a preempted build retries on a fresh node.
  • Rendering, ML training with checkpoints, ETL — long, interruptible, idempotent work.
  • Stateless web tiers behind a load balancer with an on-demand fallback group.

What doesn't belong: stateful singletons, databases, and latency-critical services without redundancy.

Spot on GKE

A GKE node pool with spot = true runs Spot VMs at the same discount — the standard pattern for fault-tolerant pods. Keep system and stateful pods on a regular pool, taint the Spot pool, and schedule interruptible workloads onto it with tolerations. This is the GCP analog of the AWS approach in when spot instances save money and the Spot node pool in AKS cost optimization.

Design for preemption

  1. Checkpoint long-running jobs so a preemption costs minutes, not the whole run.
  2. Make workers idempotent and retryable.
  3. Spread across zones so a single-zone capacity crunch doesn't take everything.
  4. Use MIGs or GKE to auto-recreate preempted instances.
  5. Keep a small on-demand baseline for anything that must stay up; burst onto Spot.

FAQ

How much do GCP Spot VMs save?

Spot VMs run 60-91% below on-demand pricing for the same machine type, with the exact discount varying by machine family and region. The tradeoff: Compute Engine can preempt (reclaim) them at any time with a 30-second warning when it needs the capacity back.

What's the difference between Spot VMs and preemptible VMs?

Spot VMs are the newer model and have no maximum runtime, while the older preemptible VMs were capped at 24 hours. Both can be preempted at any time and both carry the same deep discount. New workloads should use Spot; preemptible is effectively the legacy name.

What workloads are right for Spot VMs?

Fault-tolerant, stateless, or checkpointable work: batch processing, CI/CD runners, data pipelines, rendering, and stateless web tiers behind a load balancer with on-demand fallback. Anything that can lose a node and continue (or retry) is a fit. Avoid Spot for stateful singletons and latency-critical services without redundancy.

Can I use Spot VMs with GKE?

Yes. A GKE node pool with spot = true runs Spot VMs at the same 60-91% discount, ideal for fault-tolerant pods. Keep system and stateful pods on a standard pool and schedule interruptible workloads onto the Spot pool with taints/tolerations.

How do I handle preemption?

Design for it: checkpoint long jobs, make workers idempotent and retryable, spread across zones, and use managed instance groups or GKE to automatically recreate preempted nodes. A mix of a small on-demand baseline plus a Spot burst pool gives both reliability and savings.

How does C3X estimate Spot savings?

Spot prices float, so C3X prices a google_compute_instance at the on-demand rate as a conservative ceiling and flags Spot usage — your real Spot cost will be 60-91% lower. That keeps estimates from understating cost based on a discount that can change.

What to do next

Because Spot prices float, model the conservative case and treat the discount as upside. C3X prices a google_compute_instance and google_container_node_pool at the on-demand ceiling and flags Spot, so estimates never understate cost on a discount that can change. The quickstart runs it on your Terraform in minutes.

Try C3X on your own Terraform

Free and open source. No API key required. One command to install, one command to estimate.