Google CloudCloud DataflowDatabase

`google_dataflow_job` cost estimation

A managed Apache Beam job for streaming and batch data processing. Priced per worker-hour by machine type and tier, plus data processed.

A google_dataflow_job is a serverless Apache Beam job for streaming or batch data processing. Pricing has several layers.

Worker compute: each worker is a Compute Engine VM, billed per second. The default machine type is n1-standard-1, but workloads typically use n1-standard-4 or higher. A typical 5-worker job at n1-standard-4 costs ~$0.95/hour ($693/month if running continuously).

Streaming Engine: separate billing layer ($0.018/SCU-hour) that offloads streaming compute from the workers to a managed service. Reduces worker count for streaming jobs but adds a per-SCU charge.

Dataflow Prime: enables Streaming Engine, vertical autoscaling, runtime upgrades. Bills per "Data Compute Unit" instead of separate worker + storage charges. Right for modern streaming workloads.

Persistent Disk: streaming jobs typically need PD storage attached to each worker for state. Billed at standard PD rates ($0.04/GB-month for pd-standard, $0.10/GB-month for pd-balanced).

Data shuffle: $0.011/GB shuffled in batch jobs. Shuffle-intensive operations (joins, group-by) can be a significant cost.

Streaming Data Processed: $0.018/GB ingested in streaming jobs using Streaming Engine.

Common cost drivers:

24/7 streaming jobs: workers run continuously. Right-size machine type and minimize worker count.

Backfill batch jobs: short-lived but can scale to hundreds of workers. Cost is bounded by total data volume + processing time.

Stateful operations: windowed aggregations, sessions, deduplication — all require PD storage that accumulates over time.

c3x estimates Dataflow jobs based on declared parameters. Variable cost (data processed) comes from c3x-usage.yml.

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "google_dataflow_job" "etl" {
  name              = "daily-etl"
  template_gcs_path = "gs://dataflow-templates/latest/Word_Count"
  temp_gcs_location = "gs://my-bucket/dataflow-temp"

  parameters = {
    inputFile = "gs://my-bucket/input.txt"
    output    = "gs://my-bucket/output"
  }

  max_workers = 10

  labels = {
    environment = "production"
  }
}

Pricing dimensions

What you actually pay for when you provision google_dataflow_job.

Dimension	Unit	What's being charged
Worker compute (n1-standard-4)	per worker per hour	Compute Engine VMs running the job. Billed per second. $0.19/hour for n1-standard-4 in us-central1
Streaming Engine	per SCU-hour	Optional. Offloads streaming compute from workers. Right for streaming jobs. $0.018/SCU-hour
Persistent Disk for workers	per GB-month	PD attached to each worker for state. pd-balanced default. $0.10/GB-month
Data shuffle (batch)	per GB	Data shuffled between workers in batch jobs. $0.011/GB
Streaming data processed	per GB	Bytes ingested by streaming jobs using Streaming Engine. $0.018/GB
Dataflow Prime (Data Compute Units)	per DCU-hour	Modern bundled billing covering compute, shuffle, and PD as one rate.

Optimization tips

Common ways to reduce google_dataflow_job cost without changing the workload.

Right-size worker machine type

Workload-dependent

Default n1-standard-1 is rarely the right choice. Most ETL workloads run faster on n1-standard-4 (more CPU per worker). For memory-bound workloads, use n1-highmem variants.

Use Streaming Engine for long-running streaming

Workload-dependent

Streaming Engine reduces worker count needed by offloading streaming compute. Trade-off: per-SCU fee, but typically cheaper for 24/7 streaming jobs than worker-only billing.

Set max_workers to cap accidental scaling

Avoid runaway scaling

Without max_workers, autoscale can spin up hundreds of workers during traffic spikes. Setting max_workers = 10 (or whatever's appropriate) caps the worst-case bill.

Use Flexible Resource Scheduling (FlexRS) for batch

40%

FlexRS jobs can use Spot VMs (cheaper, preemptible) for batch workloads that tolerate delays. ~40% cheaper than guaranteed compute.

Skip Dataflow for simple transformations

Minimum job overhead

Dataflow has minimum job overhead. For one-shot CSV transformations or simple file processing, Cloud Run or a Cloud Functions job is much cheaper than even a brief Dataflow run.

FAQ

Streaming Engine: yes or no?

For streaming jobs running 24/7, yes. Streaming Engine reduces the worker count needed and the total cost is usually lower. For batch jobs, Streaming Engine doesn't apply.

How does c3x estimate Dataflow?

From declared workers (max_workers, machine_type), c3x estimates compute cost. Data shuffle and streaming data processed are usage-based; specify in c3x-usage.yml for accurate estimates.

Dataflow or Spark on Dataproc?

Dataflow is fully serverless (no cluster to manage). Dataproc gives you a managed Spark/Hadoop cluster you size manually. Dataflow is simpler operationally but typically 1.5-2x the per-DCU cost of equivalent Dataproc. Right for teams without dedicated data engineering.

What about Dataflow Prime?

Prime uses Data Compute Units (DCUs) instead of separate compute + storage billing. Includes Streaming Engine and vertical autoscaling automatically. Often results in lower total cost for streaming workloads at the cost of more complex billing reasoning.