awsspotec2cost-optimization

AWS Spot instances: when do they actually save money?

Spot instances offer 50-90% discounts but can be reclaimed with 2 minutes notice. Here's when they're worth it, how to handle interruptions gracefully, and how to combine Spot with Savings Plans for the lowest total cost.

The C3X Team··10 min read

Quick answer

Spot instances are 50-90% cheaper than on-demand but can be interrupted with 2 minutes notice. They're right for stateless, interruption-tolerant workloads: CI runners, batch jobs, web services behind a load balancer, container workloads. Wrong for stateful single-instance workloads like databases. Modern AWS patterns (Auto Scaling Group with capacity_type = "SPOT", EKS with Karpenter, ECS capacity providers) handle Spot diversification automatically. Mix Spot with on-demand or Savings Plan-covered instances for baseline reliability.

Spot instances offer the deepest discounts in AWS — often 80% off or more — but the discount is conditional. AWS reclaims the capacity with 2 minutes notice when demand for on-demand instances rises. For workloads that can tolerate interruption, Spot is a massive cost saver. For workloads that can't, the savings aren't worth the risk. This post walks through where Spot makes sense, where it doesn't, and how to implement it correctly.

How Spot pricing actually works

Spot instances are spare EC2 capacity AWS sells at a discount. Pricing is set by AWS (not a real-time auction since 2017) and changes gradually based on supply and demand. Each instance type in each AZ has its own Spot price, typically 50-90% below on-demand.

When on-demand demand spikes in a region, AWS reclaims Spot capacity to serve those customers. You get a 2-minute warning, then the instance is terminated. The warning is delivered through:

  • EC2 instance metadata service (the canonical source)
  • CloudWatch Events / EventBridge
  • Spot Instance Interruption Notice CloudWatch alarm

Two minutes is short but workable for most graceful shutdown patterns: drain from load balancer, save state, exit. The trick is having that automation in place before you adopt Spot.

What savings to expect

Approximate Spot vs on-demand savings for popular instance families:

  • m6i.xlarge in us-east-1: ~$0.06/hour Spot vs $0.192/hour on-demand (~68% off)
  • c6i.4xlarge in us-east-1: ~$0.27/hour Spot vs $0.68/hour on-demand (~60% off)
  • r7g.2xlarge in us-east-1: ~$0.15/hour Spot vs $0.42/hour on-demand (~64% off)
  • g4dn.xlarge in us-east-1 (GPU): ~$0.16/hour Spot vs $0.526/hour on-demand (~70% off)

Discounts are typically deeper for older or less-popular instance types, and shallower for newer or constrained types. GPU instances can have very high savings but also high interruption rates.

The Spot Instance Advisor (console.aws.amazon.com/ec2/spot/instance-advisor) shows current savings and interruption rates per type. Check it before committing to a specific Spot configuration.

Workloads where Spot is right

Stateless web services behind a load balancer

A web tier with 10 instances behind an ALB can run on Spot. If 1-2 instances are reclaimed, the load balancer routes around them; new instances spin up from the Auto Scaling Group within minutes. End users see no impact.

Pattern: ASG with mixed_instances_policy that includes 70-90% Spot capacity. ALB health checks drain reclaimed instances. New Spot instances launch in different instance types to diversify against family-specific interruption events.

CI/CD runners

Build jobs and test runners are stateless and rerunnable. If a Spot instance gets reclaimed mid-build, retry the build on a fresh instance. The savings on CI compute (which can be 5-10% of an engineering organization's AWS bill) are substantial.

GitHub Actions self-hosted runners on EC2 Spot, GitLab Runner with Docker Machine on Spot, BuildKite agents on Spot — all common production patterns.

Batch and data processing

Spark jobs, Glue jobs (with FlexRS), Dataflow jobs (with Flex scheduling), ML training, video encoding, ETL pipelines. Most batch workloads tolerate restart with checkpointing.

EMR clusters specifically support Spot for task nodes; core nodes usually stay on-demand to preserve the cluster's HDFS state.

Container workloads on EKS or ECS

Karpenter on EKS provisions Spot instances by default when configured to do so. ECS capacity providers can mix Fargate Spot with Fargate or use Spot EC2. Kubernetes' inherent scheduling flexibility makes it well-suited to Spot diversification.

Development and testing

Dev environments, test infrastructure, ephemeral workloads. The downside of interruption is minimal; the cost savings are huge.

Workloads where Spot is wrong

Stateful single-instance services

Databases, single-instance caches without clustering, message brokers without HA configuration. A 2-minute warning isn't enough to safely migrate Postgres or shutdown Redis without data loss in all configurations.

Exception: clustered or replicated stateful systems with explicit recovery (e.g., Postgres replicas, Kafka brokers with replication factor 3). The cluster can tolerate node loss; the cost savings apply.

Latency-sensitive on-demand traffic

If a Spot interruption would cause user-visible latency spikes (request drops, increased connection setup time), Spot might not be appropriate even with graceful shutdown. The 2-minute drain window assumes you can move connections without disruption.

Workloads with long startup time

If your service takes 10 minutes to start (large JVM, big model load, complex initialization), Spot is risky. By the time a new instance is ready, the original was reclaimed and you've potentially had several minutes of reduced capacity.

Implementation patterns

Auto Scaling Group with mixed instances

Modern recommended approach. Single ASG with a mix of on-demand and Spot, across multiple instance types and AZs for diversification:

resource "aws_autoscaling_group" "web" {
  desired_capacity = 10
  max_size         = 30
  min_size         = 5

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 30
      spot_allocation_strategy                 = "price-capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web.id
      }

      override {
        instance_type = "m6i.xlarge"
      }
      override {
        instance_type = "m6a.xlarge"
      }
      override {
        instance_type = "m5.xlarge"
      }
    }
  }
}

This setup keeps 2 instances always on-demand, then 30% of the scaled capacity on-demand, with 70% on Spot. Diversifies across three instance types so an interruption event on m6i doesn't take out the whole fleet.

EKS with Karpenter

Karpenter provisions Kubernetes worker nodes based on pending pod requirements. With capacity_type = ["spot", "on-demand"], it picks the cheapest available option. Built-in diversification across instance types.

ECS with capacity providers

ECS capacity_provider_strategy can weight Fargate Spot, EC2 Spot, and on-demand. Tasks distribute across providers based on the weights.

Fargate Spot

For Fargate workloads, Fargate Spot offers up to 70% off Fargate's on-demand price. Same 2-minute warning. Right for stateless tasks with proper task placement.

A realistic cost comparison

Take a typical web tier needing 10 vCPUs of compute, running 24/7 on m6i.large instances:

All on-demand

  • 10 × m6i.large × 730 hours × $0.096/hour = ~$700/month

30% on-demand, 70% Spot

  • 3 on-demand × 730 × $0.096 = $210
  • 7 Spot × 730 × $0.032 (typical Spot rate) = $164
  • Total: $374/month (47% savings)

70% Compute SP commitment + 30% Spot for burst

  • 7 Compute SP-covered × 730 × $0.070 (~27% off): $358
  • 3 Spot × 730 × $0.032: $70
  • Total: $428/month (39% savings)

Mix-and-match strategies (SP for baseline, Spot for everything above baseline) often produce the lowest total cost while keeping interruption risk bounded.

How c3x models Spot in estimates

Set purchaseOption: "spot" on the resource in c3x-usage.yml:

# c3x-usage.yml
resource_usage:
  aws_instance.worker:
    purchaseOption: "spot"

c3x applies the historical Spot price advisor's typical rate for the instance type. Real Spot prices fluctuate; the estimate is a reasonable mid-point. For more precision in c3x estimates including mixed strategies, see how to estimate AWS costs from Terraform.

FAQ

How much do Spot instances actually save?

Typically 50-90% off on-demand prices, varying by instance type and region. The advisor at instance type level changes daily based on supply and demand. For most general-purpose instance families in major regions, expect 60-70% savings on average. The discount is real but comes with the trade-off of potential interruption with 2-minute notice.

What workloads are right for Spot?

Stateless services that can tolerate restart: web servers behind a load balancer, batch jobs, CI/CD runners, data processing pipelines, training workloads, container workloads with proper orchestration. Wrong for stateful single-instance services (databases, in-memory caches with no clustering, dev workstations).

How often do Spot interruptions actually happen?

Highly variable. AWS publishes the Spot Instance Advisor which shows historical interruption rates per instance type. Major instance families in popular regions often have <5% monthly interruption rates. Niche instance types or capacity-constrained regions can see 20%+. Diversification across instance types and AZs reduces effective interruption risk.

Can I use Spot with Reserved Instances?

Yes, but they don't combine directly. RIs and Savings Plans apply to on-demand instances at the discount. Spot instances are a separate purchase model that doesn't get the RI/SP discount layered on top. Most teams use both: RIs/SPs cover steady baseline, Spot covers burst capacity.

How do I handle Spot interruptions gracefully?

Subscribe to the EC2 Spot Interruption Notice (a 2-minute warning) via instance metadata service or CloudWatch Events. On notice: drain the instance from your load balancer or workload manager, save state, exit cleanly. For Kubernetes workloads, the AWS Node Termination Handler does this automatically.

Spot Fleet vs Auto Scaling Group with Spot?

Auto Scaling Groups with capacity_type = 'SPOT' are the modern AWS-recommended approach. Spot Fleets are still supported but more complex. For EKS, Karpenter handles Spot diversification automatically. For ECS, capacity providers do the same. Most modern setups don't need Spot Fleet's complexity.

Recommendations

For most production AWS workloads:

  1. Identify stateless workloads that tolerate restart. Web tiers, CI runners, batch jobs are usually the first candidates.
  2. Set up graceful shutdown via the EC2 Spot Interruption Notice or AWS Node Termination Handler for Kubernetes.
  3. Start with 30-50% Spot in an Auto Scaling Group with diversification across instance types. Watch interruption rates for two weeks.
  4. Increase Spot percentage if interruptions are manageable. Many web tiers run 70-90% Spot in production at major companies.
  5. Combine with Compute Savings Plans for the on-demand portion. See Reserved Instances vs Savings Plans.

For per-resource pricing including Spot configurations, see the aws_instance catalog and the aws_eks_cluster catalog.

Try C3X on your own Terraform

Free and open source. No API key required. One command to install, one command to estimate.