google_dataproc_cluster cost estimation
Managed Spark, Hadoop, Flink, and Presto. Bills underlying Compute Engine VMs + $0.010/vCPU-hour Dataproc fee. Dataproc Serverless eliminates cluster management. Dataproc on GKE runs Spark on existing GKE.
Cloud Dataproc is GCP's managed Spark/Hadoop platform. The google_dataproc_cluster resource creates a cluster on Compute Engine VMs. Pricing has two components: GCE VM costs + a Dataproc fee.
Dataproc fee: $0.010 per vCPU-hour on top of GCE pricing. Modest compared to AWS EMR's $0.063-$0.27 service fee. Apply to all VMs in the cluster (master + workers).
A typical cluster: 1 master n2-standard-4 (4 vCPU) + 5 workers n2-standard-8 (8 vCPU each = 40 vCPU), running 4 hours: - GCE: ~$0.20/hour × 1 + $0.40/hour × 5 = $2.20/hour - Dataproc fee: 4 + 40 = 44 vCPU × $0.010 = $0.44/hour - Total per hour: $2.64 - 4-hour run: $10.56
Dataproc Serverless for Spark (newer): - $0.060/vCPU-hour for compute - $0.0078/GB-hour for memory - No cluster management; auto-scales - Right for sporadic batch jobs
Dataproc on GKE: runs Spark inside existing GKE clusters. Eliminates dedicated Dataproc VMs; reuses GKE capacity. Still has the Dataproc fee ($0.010/vCPU-hour) but no separate cluster lifecycle.
Spot (Preemptible) VMs for workers: 60-91% cheaper than on-demand. Spark handles preemption gracefully. Use for task-tier workers; master should stay on-demand.
c3x estimates Dataproc based on master_config and worker_config (machine_type, num_instances) and the Dataproc fee.
Terraform example
A minimal but realistic configuration that C3X can estimate.
resource "google_dataproc_cluster" "spark" {
name = "spark-cluster"
region = "us-central1"
cluster_config {
master_config {
num_instances = 1
machine_type = "n2-standard-4"
}
worker_config {
num_instances = 3
machine_type = "n2-standard-8"
}
preemptible_worker_config {
num_instances = 5 # Spot workers
}
software_config {
image_version = "2.2-debian12"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
}
lifecycle_config {
idle_delete_ttl = "3600s" # Auto-delete after 1 hour idle
}
}
}Pricing dimensions
What you actually pay for when you provision google_dataproc_cluster.
| Dimension | Unit | What's being charged |
|---|---|---|
| GCE VM cost | per VM-hour | Standard Compute Engine pricing for cluster VMs. Sustained Use Discounts apply. $0.10-$5/hour depending on machine type |
| Dataproc fee | per vCPU-hour | Additional Dataproc service fee on top of GCE. Applies to all VMs in cluster. $0.010/vCPU-hour |
| Dataproc Serverless compute | per vCPU-hour | Serverless Spark; bills per actual usage. No cluster management. $0.060/vCPU-hour |
| Dataproc Serverless memory | per GB-hour | Memory portion of Serverless billing. $0.0078/GB-hour |
| Persistent disk | per GB-month | Disks attached to cluster VMs. SSD-PD by default. $0.17/GB-month SSD |
Optimization tips
Common ways to reduce google_dataproc_cluster cost without changing the workload.
Use Preemptible workers
60-91% on worker capacitySpot/Preemptible workers are 60-91% cheaper than on-demand. Spark tolerates preemption — failed tasks reschedule. Configure preemptible_worker_config; keep master on-demand for stability.
Set idle_delete_ttl
Variable — avoids wasteIdle clusters bill GCE + Dataproc fees continuously. lifecycle_config.idle_delete_ttl = 3600s auto-deletes after 1 hour idle. Critical for ephemeral workloads to avoid forgotten clusters.
Use Dataproc Serverless for sporadic jobs
70-90% on sporadic workloadsJobs running occasionally pay for cluster idle time in traditional Dataproc. Serverless bills per actual vCPU/memory-second. Right for nightly ETL, ad-hoc queries, on-demand reports.
Use ephemeral clusters
Variable based on utilizationStart a cluster for a job, delete when done. Storage on persistent buckets (GCS), cluster is purely compute. Dramatically cheaper than maintaining long-lived clusters for episodic workloads.
Use Dataproc on GKE to reuse capacity
If you already run GKE, Dataproc on GKE runs Spark in existing clusters. Eliminates dedicated Dataproc VMs. Better utilization of cluster capacity.
FAQ
How does Dataproc fee compare to EMR fee?
Dataproc is dramatically cheaper. Dataproc fee is $0.010/vCPU-hour regardless of instance type. EMR fee varies $0.063-$0.27/instance-hour (which can equal $0.005-$0.034/vCPU-hour for similar instances, often higher). For pure managed Spark, Dataproc tends to be more cost-effective than EMR.
Should I use Dataproc Serverless or traditional clusters?
Serverless for sporadic/unpredictable workloads. Traditional clusters for steady, high-volume processing. Break-even: roughly 4-6 hours/day sustained processing → traditional. Under that → Serverless saves money via no idle cost.
How do Preemptible workers behave under load?
GCP can reclaim Preemptible VMs with 30-second notice. Spark catches the SIGTERM and reschedules in-flight tasks on remaining workers. For short jobs (under 24 hours), preemption rate is usually low (under 5%). For long jobs, occasional preemption is normal but rarely fatal.
What about Dataflow vs Dataproc?
Dataflow is GCP's serverless Apache Beam runner — pure serverless, no cluster concept. Dataproc is managed Spark/Hadoop with cluster awareness. For new pipelines, Dataflow is often simpler. For migrating existing Spark code, Dataproc is more compatible.
Related resources
Estimate this resource in your own Terraform
Free, open source, no API key. C3X parses your Terraform and shows line-item cost for every resource, including google_dataproc_cluster.