AWSAmazon EMRAnalytics

`aws_emr_cluster` cost estimation

Managed Hadoop, Spark, Hive, Presto, and HBase. Bills EC2 instance hours plus EMR service fee (typically $0.063-$0.27/instance-hour on top of EC2). EMR Serverless eliminates instance management; EMR on EKS allows reusing EKS clusters.

Amazon EMR is the managed big data platform. The aws_emr_cluster resource creates a Spark/Hadoop/Presto cluster on EC2 instances. Pricing has two components: the underlying EC2 instances and an EMR service fee.

EMR service fee varies by instance class: - m5.large: $0.063/hour on top of EC2 - m5.4xlarge: $0.27/hour on top of EC2 - r5.4xlarge: $0.27/hour on top - Instance store classes (i3, etc.): higher EMR fees

A typical EMR cluster: 1 master m5.xlarge + 5 core m5.4xlarge for 6 hours of processing: - EC2 instance hours: $0.768 + 5 × $1.536 = $8.45 - EMR service fees: $0.063 + 5 × $0.27 = $8.78 (per hour, 6 hours = $52.68) - Total: ~$61 per processing run

EMR alternatives within AWS: - EMR Serverless: pay per vCPU-second of actual usage. No idle cost. Right for sporadic jobs. - EMR on EKS: run Spark on existing EKS clusters. Eliminates dedicated EMR instances; reuses EKS capacity. - AWS Glue: managed Spark for ETL specifically. Different pricing model ($0.44/DPU-hour).

EMR on Spot is common: Spot for task nodes can save 60-90%. Master/core nodes typically on-demand for stability.

c3x estimates EMR based on instance_group blocks (master, core, task), instance_type, and applies the appropriate EMR service fee per class.

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "aws_emr_cluster" "spark" {
  name          = "spark-etl-cluster"
  release_label = "emr-7.0.0"
  applications  = ["Spark"]

  ec2_attributes {
    subnet_id        = aws_subnet.private.id
    instance_profile = aws_iam_instance_profile.emr_ec2.arn
  }

  master_instance_group {
    instance_type = "m5.xlarge"
  }

  core_instance_group {
    instance_type  = "m5.4xlarge"
    instance_count = 3
  }

  configurations_json = jsonencode([
    {
      Classification = "spark-defaults"
      Properties = {
        "spark.dynamicAllocation.enabled" = "true"
      }
    }
  ])

  auto_termination_policy {
    idle_timeout = 3600
  }
}

Pricing dimensions

What you actually pay for when you provision aws_emr_cluster.

Dimension	Unit	What's being charged
EC2 instance hours	per instance-hour	Standard EC2 pricing for each instance in the cluster. $0.192/hour for m5.xlarge
EMR service fee	per instance-hour	Additional EMR-specific fee on top of EC2. Varies by instance class. $0.063-$0.27/instance-hour
EBS volumes	per GB-month	EBS volumes attached to cluster nodes. Standard EBS pricing. $0.08/GB-month for gp3
EMR Serverless	per vCPU-second + per GB-second	Alternative billing model for sporadic workloads. No instance management. $0.052624/vCPU-hour + $0.0057785/GB-hour

Optimization tips

Common ways to reduce aws_emr_cluster cost without changing the workload.

Use Spot for task nodes

60-80% on task capacity

Task nodes have no permanent state and can tolerate interruption. Spot pricing for m5.4xlarge can be 60-80% lower. Master and core nodes on-demand for stability; task nodes on Spot for capacity.

Set auto-termination

Variable — avoids forgotten cluster costs

Idle EMR clusters bill EC2 + EMR fees continuously. auto_termination_policy with idle_timeout=3600 (1 hour) prevents forgotten clusters. Critical for dev/test usage.

Consider EMR Serverless for sporadic jobs

70-90% on sporadic workloads

Jobs running occasionally (a few hours a week) pay for idle in traditional EMR. EMR Serverless bills per actual vCPU-second used. Right for nightly ETL, ad-hoc analytics, on-demand queries.

Run Spark on EKS to reuse capacity

Variable based on cluster utilization

EMR on EKS runs Spark inside existing EKS clusters. The EMR service fee still applies ($0.05/vCPU-hour) but you avoid dedicated EMR EC2 instances. Right for teams already heavy on EKS.

Use instance fleets for diversity

Instance fleets let EMR choose from a list of compatible types. Better Spot availability, automatic failover to on-demand. Reduces Spot interruption impact.

FAQ

Is EMR Serverless cheaper than EMR?

For sporadic workloads, yes. For sustained workloads, no. EMR Serverless's per-second billing is great when jobs run minutes per day. Traditional EMR is cheaper when clusters are utilized 50%+ of the time. Rule of thumb: under 4 hours/day of actual job time → Serverless. Above → traditional.

How does AWS Glue compare to EMR?

AWS Glue is also managed Spark but ETL-specific. Glue bills $0.44/DPU-hour (1 DPU = 4 vCPU + 16 GB). Glue is more expensive per-resource but eliminates more operational overhead. For simple ETL jobs, Glue. For complex/long-running Spark, EMR.

What's the difference between EMR and Athena?

Athena is serverless SQL on S3 ($5/TB scanned). EMR is full Spark/Hive/Presto with state, complex workflows, and large-scale processing. Athena for ad-hoc queries; EMR for production data pipelines. Both can query the same S3 data.

Are Spot interruptions disruptive to EMR jobs?

Less than they used to be. EMR handles task node Spot interruptions gracefully — failed tasks reschedule on remaining nodes. Master and core nodes shouldn't be on Spot (their interruption kills the cluster). With Spot only on task nodes, jobs typically complete despite interruptions, slightly slower.