google_dataflow_job cost estimation
A managed Apache Beam job for streaming and batch data processing. Priced per worker-hour by machine type and tier, plus data processed.
A google_dataflow_job is a serverless Apache Beam job for streaming or batch data processing. Pricing has several layers.
Worker compute: each worker is a Compute Engine VM, billed per second. The default machine type is n1-standard-1, but workloads typically use n1-standard-4 or higher. A typical 5-worker job at n1-standard-4 costs ~$0.95/hour ($693/month if running continuously).
Streaming Engine: separate billing layer ($0.018/SCU-hour) that offloads streaming compute from the workers to a managed service. Reduces worker count for streaming jobs but adds a per-SCU charge.
Dataflow Prime: enables Streaming Engine, vertical autoscaling, runtime upgrades. Bills per "Data Compute Unit" instead of separate worker + storage charges. Right for modern streaming workloads.
Persistent Disk: streaming jobs typically need PD storage attached to each worker for state. Billed at standard PD rates ($0.04/GB-month for pd-standard, $0.10/GB-month for pd-balanced).
Data shuffle: $0.011/GB shuffled in batch jobs. Shuffle-intensive operations (joins, group-by) can be a significant cost.
Streaming Data Processed: $0.018/GB ingested in streaming jobs using Streaming Engine.
Common cost drivers:
24/7 streaming jobs: workers run continuously. Right-size machine type and minimize worker count.
Backfill batch jobs: short-lived but can scale to hundreds of workers. Cost is bounded by total data volume + processing time.
Stateful operations: windowed aggregations, sessions, deduplication — all require PD storage that accumulates over time.
c3x estimates Dataflow jobs based on declared parameters. Variable cost (data processed) comes from c3x-usage.yml.
Terraform example
A minimal but realistic configuration that C3X can estimate.
resource "google_dataflow_job" "etl" {
name = "daily-etl"
template_gcs_path = "gs://dataflow-templates/latest/Word_Count"
temp_gcs_location = "gs://my-bucket/dataflow-temp"
parameters = {
inputFile = "gs://my-bucket/input.txt"
output = "gs://my-bucket/output"
}
max_workers = 10
labels = {
environment = "production"
}
}Pricing dimensions
What you actually pay for when you provision google_dataflow_job.
| Dimension | Unit | What's being charged |
|---|---|---|
| Worker compute (n1-standard-4) | per worker per hour | Compute Engine VMs running the job. Billed per second. $0.19/hour for n1-standard-4 in us-central1 |
| Streaming Engine | per SCU-hour | Optional. Offloads streaming compute from workers. Right for streaming jobs. $0.018/SCU-hour |
| Persistent Disk for workers | per GB-month | PD attached to each worker for state. pd-balanced default. $0.10/GB-month |
| Data shuffle (batch) | per GB | Data shuffled between workers in batch jobs. $0.011/GB |
| Streaming data processed | per GB | Bytes ingested by streaming jobs using Streaming Engine. $0.018/GB |
| Dataflow Prime (Data Compute Units) | per DCU-hour | Modern bundled billing covering compute, shuffle, and PD as one rate. |
Optimization tips
Common ways to reduce google_dataflow_job cost without changing the workload.
Right-size worker machine type
Workload-dependentDefault n1-standard-1 is rarely the right choice. Most ETL workloads run faster on n1-standard-4 (more CPU per worker). For memory-bound workloads, use n1-highmem variants.
Use Streaming Engine for long-running streaming
Workload-dependentStreaming Engine reduces worker count needed by offloading streaming compute. Trade-off: per-SCU fee, but typically cheaper for 24/7 streaming jobs than worker-only billing.
Set max_workers to cap accidental scaling
Avoid runaway scalingWithout max_workers, autoscale can spin up hundreds of workers during traffic spikes. Setting max_workers = 10 (or whatever's appropriate) caps the worst-case bill.
Use Flexible Resource Scheduling (FlexRS) for batch
40%FlexRS jobs can use Spot VMs (cheaper, preemptible) for batch workloads that tolerate delays. ~40% cheaper than guaranteed compute.
Skip Dataflow for simple transformations
Minimum job overheadDataflow has minimum job overhead. For one-shot CSV transformations or simple file processing, Cloud Run or a Cloud Functions job is much cheaper than even a brief Dataflow run.
FAQ
Streaming Engine: yes or no?
For streaming jobs running 24/7, yes. Streaming Engine reduces the worker count needed and the total cost is usually lower. For batch jobs, Streaming Engine doesn't apply.
How does c3x estimate Dataflow?
From declared workers (max_workers, machine_type), c3x estimates compute cost. Data shuffle and streaming data processed are usage-based; specify in c3x-usage.yml for accurate estimates.
Dataflow or Spark on Dataproc?
Dataflow is fully serverless (no cluster to manage). Dataproc gives you a managed Spark/Hadoop cluster you size manually. Dataflow is simpler operationally but typically 1.5-2x the per-DCU cost of equivalent Dataproc. Right for teams without dedicated data engineering.
What about Dataflow Prime?
Prime uses Data Compute Units (DCUs) instead of separate compute + storage billing. Includes Streaming Engine and vertical autoscaling automatically. Often results in lower total cost for streaming workloads at the cost of more complex billing reasoning.
Related resources
Estimate this resource in your own Terraform
Free, open source, no API key. C3X parses your Terraform and shows line-item cost for every resource, including google_dataflow_job.