AWSAWS GlueDatabase

`aws_glue_job` cost estimation

A managed ETL job using Spark or Python shell. Priced per DPU-hour with 1-minute billing minimum, plus development endpoint and Data Catalog costs.

An aws_glue_job is a serverless ETL job that runs Spark, Python shell, or Ray workloads. The job itself has no fixed cost; you pay only when it runs.

Pricing depends on the worker type:

Spark jobs (G.1X, G.2X, G.4X, G.8X, G.025X for Streaming): billed per DPU-hour. A DPU (Data Processing Unit) represents 4 vCPU + 16 GB memory. G.1X = 1 DPU; G.2X = 2 DPUs; G.4X = 4 DPUs; G.8X = 8 DPUs. Rate is $0.44/DPU-hour with a 1-minute minimum.

Python shell jobs: cheaper, billed per DPU-hour at the same rate but each job uses 0.0625 or 1 DPU. A 1-minute Python shell job at 0.0625 DPU costs $0.0005, basically negligible.

Ray jobs (G.025X): billed per worker-hour. Right for distributed Python workloads.

A typical Spark job with 5 G.1X workers running for 30 minutes costs: 5 DPUs × 0.5 hours × $0.44/DPU-hour = $1.10 per run. Hourly schedules add up: $792/month if running 24/7, much less for scheduled batch.

Other Glue components have their own costs:

Glue Data Catalog: first 1M objects/month free, then $1/100K objects/month. Object requests are usually free.

Glue Crawlers: same DPU-hour rate as jobs ($0.44/DPU-hour), with a 1-minute minimum per crawler run.

Glue Studio: free as an editor; the jobs it generates bill normally.

c3x estimates Glue jobs only with usage data: monthly_run_count and average_run_duration_minutes in c3x-usage.yml.

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "aws_glue_job" "etl" {
  name              = "daily-etl"
  role_arn          = aws_iam_role.glue.arn
  number_of_workers = 5
  worker_type       = "G.1X"
  glue_version      = "4.0"

  command {
    script_location = "s3://my-bucket/glue/daily-etl.py"
    python_version  = "3"
  }

  default_arguments = {
    "--job-bookmark-option" = "job-bookmark-enable"
    "--enable-metrics"      = "true"
  }

  timeout       = 120  # minutes
  max_retries   = 1
}

Pricing dimensions

What you actually pay for when you provision aws_glue_job.

Dimension	Unit	What's being charged
Spark job DPU-hours (G.1X, G.2X, G.4X, G.8X)	per DPU-hour	Each DPU is 4 vCPU + 16 GB memory. Billed per second with a 1-minute minimum per run. $0.44/DPU-hour
Python shell jobs	per DPU-hour	Same rate as Spark but uses smaller DPU sizes (0.0625 or 1 DPU). Cheaper for lightweight ETL. $0.44/DPU-hour
Ray jobs (G.025X)	per worker-hour	Smaller worker size optimized for Ray-based distributed Python.
Glue Crawlers	per DPU-hour	Crawler runs use the same DPU pricing as jobs.
Data Catalog	per 100K objects per month	First 1M objects free at the account level. Object requests are also free. $1.00/100K objects beyond free tier

Optimization tips

Common ways to reduce aws_glue_job cost without changing the workload.

Right-size DPU count and worker type

Linear with worker count

Many Glue jobs run with 10 G.1X workers when 3 would suffice. Use the Glue job metrics tab to find the actual DPU utilization and shrink.

Use job bookmarks for incremental processing

Workload-dependent

Job bookmarks track processed data so subsequent runs only process new data. Cuts run duration by 90%+ on large incremental datasets.

Use Python shell jobs for small ETL

Significant on small jobs

Spark jobs require minimum 2 workers. For lightweight transformations on small data, Python shell at 0.0625 DPU is far cheaper than Spark.

Schedule jobs at off-peak times if possible

Indirect on downstream resources

Glue itself doesn't have spot pricing or time-of-day discounts, but downstream resources (S3, EMR) might. Coordinating job schedules with off-peak times can reduce total bill.

Use Glue 4.0+ for performance improvements

10-30% on job runtime

Newer Glue versions run jobs faster than older versions on the same DPU count. Migrating jobs from Glue 3.0 to 4.0 typically cuts duration by 10-30%.

FAQ

Why are my Glue jobs so expensive?

Three common causes: over-provisioning DPU count (using 20 workers when 5 would do), running too frequently (hourly when daily would suffice), and processing the full dataset each run (no bookmarks). Address each separately.

How does c3x estimate Glue job cost?

From the job's worker_type and number_of_workers (DPU count) plus expected monthly_runs and average_run_duration_minutes from c3x-usage.yml. Without usage data, the job shows $0.

What's the difference between Glue and EMR?

Glue is fully managed, serverless, lower operational overhead, but higher per-DPU-hour cost than equivalent EMR clusters. EMR gives more control over Spark configuration and is cheaper per DPU but requires cluster management. Pick Glue for ETL where simplicity matters; EMR for analytics where performance tuning matters.

Is Glue Data Catalog free?

Mostly. First 1M objects per account are free. Object access (GetTable, GetPartitions, etc.) is also free. Beyond 1M objects, $1/100K/month. Most accounts stay within the free tier.