awsec2right-sizingcost-optimization

Right-sizing EC2 instances: the 20/50 rule and full process

Right-sizing cuts steady-state EC2 spend 20-40% in one pass. The 20/50 rule (average under 20%, peak under 50%) finds candidates. Here's the full pipeline: gather metrics, decide size, validate against periodic peaks, deploy in waves with monitoring.

The C3X Team··11 min read

Quick answer

Right-sizing is the highest-leverage EC2 optimization. A typical right-sizing pass on a 14-day observation window cuts steady-state compute spend 20-40%. The process: gather CloudWatch CPU/memory/network metrics, identify instances with average under 20% and p99 under 50%, recommend one step down in size or class, validate against periodic peaks, deploy with monitoring. Do this before committing Savings Plans.

Every EC2 fleet has waste. Engineers pick instance sizes based on guesses, deploy at the start of the project, and rarely revisit. Over months, traffic patterns change, code optimizations land, peak workloads shift. The instance size doesn't follow.

Right-sizing closes that gap. This post covers the actual process: what to measure, how to decide, how to validate, and the pitfalls that cause right-sizing to make things worse instead of better.

The 20/50 rule

The simplest right-sizing heuristic: an instance is a candidate for downsizing if both:

  • Average CPU utilization over 14 days is below 20%
  • Peak (p99) CPU utilization is below 50%

Why both? Average alone misses peak workloads. A bursty service with 8% average and 90% spikes is correctly sized — downsizing would cause throttling during peaks. Peak alone misses steady-state waste. A consistently 30% utilized instance has room to shrink even though its peak is 35%.

The 20/50 rule is a starting filter. Once an instance passes it, look closer.

The right-sizing pipeline

Step 1: Gather metrics

For each EC2 instance, pull from CloudWatch for the last 14 days:

  • CPUUtilization: average, p95, p99, max
  • NetworkIn/NetworkOut: average, peak
  • DiskReadOps/DiskWriteOps: average, peak (for instance-store)
  • EBS metrics: VolumeQueueLength, BurstBalance (for gp2/gp3)

Memory metrics require the CloudWatch agent — install it on anything you plan to right-size. Without memory data you'll right-size CPU-bound workloads correctly and break memory-bound ones silently.

Step 2: Apply 20/50 to filter

From the fleet, identify instances that meet the threshold. For a typical fleet of 100 instances, expect 30-50% to qualify.

Step 3: Determine the right new size

Several heuristics depending on the instance type:

Same-class downsize: m6i.4xlarge to m6i.2xlarge. Half the vCPU/memory at half the price. Right if both CPU and memory utilization show headroom.

Cross-class downsize: m6i.large to t3.large. Same nominal size, but burstable. Right for idle-most-of-the-time workloads.

Cross-family change: m6i.2xlarge to r6i.large. Lower vCPU but more memory per vCPU. Right when workload is memory-bound, not CPU-bound.

Generation upgrade: m5.4xlarge to m6i.4xlarge. Same nominal size but newer generation is 10-20% cheaper and 20-30% faster. Always positive — every legacy generation instance should migrate when possible.

Step 4: Validate against periodic peaks

Right-sizing based on 14 days misses periodic workloads. Before deploying:

  • Check if the workload has month-end, quarter-end, or year-end peaks
  • Verify the recommended size handles the historical peak with headroom
  • For batch workloads, identify the largest job and verify the new size handles it

Step 5: Deploy with monitoring

Don't bulk-deploy. Right-size in waves:

  1. Pilot: 1-2 instances first, monitor for 1 week
  2. Expansion: 10-20 instances if pilot is clean
  3. Full rollout: remaining instances if expansion is clean

Set up CloudWatch alarms on the right-sized instances: CPUUtilization above 80% sustained for 5 minutes, memory above 85%, queue depths growing. These alarms catch a too-aggressive downsize before users notice.

Real-world right-sizing math

A fleet with 100 m5.2xlarge instances ($0.384/hour each on-demand) costs:

100 × $0.384 × 730 hours = $28,032/month on-demand.

After right-sizing analysis: 40 instances pass 20/50 and can move to m5.xlarge ($0.192/hour). 20 instances are aging steady and can move to t3.2xlarge ($0.333/hour) for burstable savings. 40 instances stay at m5.2xlarge.

New fleet:

  • 40 × $0.192 × 730 = $5,606
  • 20 × $0.333 × 730 = $4,862
  • 40 × $0.384 × 730 = $11,213
  • Total: $21,681

Savings: $6,351/month = $76,212/year. 23% reduction. And we haven't touched Savings Plans yet.

AWS Compute Optimizer integration

Compute Optimizer analyzes 14 days of CloudWatch data and produces per-instance recommendations. Free in most regions. Enable it once at the organization level.

Output is a recommendation per instance with confidence rating and estimated monthly savings. Use the recommendations as a starting point. Cross-check against periodic peaks before deploying.

Compute Optimizer doesn't see memory by default — install the CloudWatch agent on candidates first so it can incorporate memory metrics. The "memory" pillar of the recommendation only engages with agent-supplied memory data.

Right-sizing gotchas

Right-sizing under Savings Plans

If you have a Compute Savings Plan, downsizing reduces hourly spend but the commit stays. If you downsize too aggressively, you'll spend less per hour than your committed rate, and AWS bills you the difference. Either downsize within the commit envelope or wait for the commit to expire.

For more on Savings Plan strategy with right-sizing, see Reserved Instances vs Savings Plans.

Network performance bracket changes

m5.large has "up to 10 Gbps" network. m5.4xlarge has "up to 10 Gbps" but with a much higher sustained baseline. Workloads that push network heavily can perform worse on a smaller size even if CPU has headroom.

Per-vCPU licensing

SQL Server, Oracle, Windows Server licensing is per-vCPU. Downsizing reduces vCPU and license cost. Sometimes the license savings exceed the instance savings — don't forget to factor them in.

Burst balance on EBS

gp2 volumes have burst credits that accumulate during low IO and burn during peaks. Smaller instances often have smaller EBS-optimized bandwidth. Right-sizing CPU without checking EBS BurstBalance can cause IO throttling.

Continuous right-sizing

Right-sizing isn't a one-time exercise. Workloads change. Build a quarterly review:

  • Run Compute Optimizer report
  • Filter for "high confidence" recommendations
  • Validate against periodic patterns
  • Deploy in waves with monitoring

For green-field instances, set them up with reasonable starting sizes and revisit in 60 days when you have real telemetry.

Right-sizing with c3x

c3x estimates the cost of your proposed right-sizing changes from Terraform. If you're about to change instance_type from m5.2xlarge to m5.xlarge, c3x shows the dollar delta directly in your CI:

c3x estimate --diff
# Output:
# aws_instance.api: $280.32 -> $140.16 (-$140.16, -50%)
# aws_instance.worker: $280.32 -> $280.32 (no change)
# Total monthly delta: -$140.16

For the full instance pricing reference, see the aws_instance catalog page.

FAQ

What CPU utilization threshold means an instance is oversized?

Sustained average below 20% over 14+ days, with peak (p99) below 50%, is a strong oversized signal. The 20/50 rule: average under 20%, peak under 50%, means there's headroom to downsize one generation or class. Be careful with bursty workloads — a 10% average with regular 90% spikes is correctly sized.

How long should I observe before right-sizing?

Two weeks minimum for steady-state workloads; one full business cycle for periodic workloads (monthly close, quarterly batch jobs, year-end runs). The risk of right-sizing on a sample that excludes a periodic peak is downsizing and then OOM'ing during the next peak.

Should I use AWS Compute Optimizer recommendations?

Yes, as a starting point. Compute Optimizer analyzes 14 days of CloudWatch data and produces sized recommendations with confidence ratings. The accuracy is usually good for steady-state workloads. Validate the recommendation against your own knowledge of the workload (Compute Optimizer doesn't know about your peak patterns).

Is downsizing to a burstable (t-class) instance ever right?

For workloads that are idle most of the time with occasional bursts (development servers, low-traffic services, batch coordinators), yes. T-class instances earn CPU credits during idle periods and spend them during bursts. They're roughly half the price of comparable m-class. Avoid for steady-state CPU-bound workloads — you'll exhaust credits and throttle.

What about right-sizing memory-bound workloads?

Memory utilization isn't reported by default in CloudWatch — install the CloudWatch agent or use the unified agent. Once you have memory metrics, the same 20/50 rule applies. Memory-bound workloads often benefit from moving from m-class (general) to r-class (memory-optimized) which has 2x memory per vCPU at a moderate price premium.

Should I right-size before or after buying Savings Plans?

Right-size first, then commit. Buying Savings Plans on oversized instances locks you into the wrong shape for 1-3 years. Order: (1) right-size, (2) observe stability for two weeks, (3) commit Savings Plans on the right-sized baseline.

Summary

Right-sizing is the single highest-leverage EC2 optimization. The 20/50 rule (average under 20%, peak under 50%) finds candidates. A 14-day observation window is the minimum; one full periodic cycle is better. Deploy in waves with monitoring. Right-size before committing Savings Plans, not after.

Once instances are right-sized, the next two optimizations are Savings Plans on the new baseline and aggressive scheduling (turning off dev/staging at night). For commitment strategy, see Reserved Instances vs Savings Plans. For the broader EC2 cost picture, see aws_instance pricing details.

Try C3X on your own Terraform

Free and open source. No API key required. One command to install, one command to estimate.