AzureAzure Data FactoryAnalytics

`azurerm_data_factory` cost estimation

Managed ETL/ELT and data integration. Pipeline orchestration ($1/1K runs), data flow execution ($0.193-$0.34/vCore-hour for General Purpose), and Integration Runtime hours. Data movement billed separately ($0.25/DIU-hour Azure-hosted).

Azure Data Factory (ADF) is Microsoft's managed ETL/ELT service. The azurerm_data_factory creates the factory; pipelines, data flows, and triggers are configured via separate resources. Pricing has several distinct components that often surprise teams.

Pipeline orchestration: - $1 per 1,000 pipeline activity runs (Azure-hosted Integration Runtime) - $1.50 per 1,000 runs (Self-hosted IR) - Includes activities like Copy, Lookup, ForEach, Wait, Set Variable

Data movement (Copy activity): - Azure-hosted IR: $0.25 per DIU-hour (Data Integration Unit). 4 DIU minimum per Copy activity. - Self-hosted IR: $0.10 per hour per node

Data flow execution (visual ETL on Spark backend): - General Purpose: $0.193/vCore-hour - Memory Optimized: $0.34/vCore-hour - Compute Optimized: $0.193/vCore-hour - Minimum 4 cores; typical data flow runs 8-32 cores for several minutes

A typical pipeline running hourly: - 24 pipeline runs/day × 30 = 720 runs/month - Pipeline orchestration: 720 × $0.001 = $0.72/month - Copy activity: 4 DIU × 720 × 0.1 hours (6 min avg) = 288 DIU-hours × $0.25 = $72/month - Total: ~$73/month for the pipeline

Self-hosted Integration Runtime (for on-prem data) requires running a VM with the IR agent. The VM costs are billed separately from ADF.

c3x estimates Data Factory based on declared pipelines and (via c3x-usage.yml) expected activity counts and Copy activity volume.

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "azurerm_data_factory" "main" {
  name                = "prod-data-factory"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Environment = "production"
  }
}

resource "azurerm_data_factory_integration_runtime_self_hosted" "onprem" {
  name            = "onprem-ir"
  data_factory_id = azurerm_data_factory.main.id
}

Pricing dimensions

What you actually pay for when you provision azurerm_data_factory.

Dimension	Unit	What's being charged
Pipeline activity runs	per 1,000 runs	Orchestration fee per pipeline activity execution. $1 per 1,000 runs (Azure IR)
Data movement (DIU)	per DIU-hour	Copy activity uses Data Integration Units. Min 4 DIU per Copy activity. $0.25/DIU-hour Azure-hosted
Data flow execution	per vCore-hour	Mapping data flows run on Spark. Min 4 cores. $0.193/vCore-hour General Purpose
Self-hosted IR	per hour per activity	Charged when activities run via self-hosted Integration Runtime. $0.10/hour per activity
External activity runs	per 1,000 runs	Activities that run on external services (Databricks, HDInsight, etc.). $0.25 per 1,000 runs

Optimization tips

Common ways to reduce azurerm_data_factory cost without changing the workload.

Tune Copy activity DIU count

20-50% on Copy cost

DIU defaults at Auto, which Azure picks based on data size. For small transfers (under 1 GB), Auto may over-provision. Set DIU=4 explicitly for small files to avoid paying for unused capacity. For large files, increase DIU for faster transfer (offset by reduced runtime).

Minimize ForEach iterations

Proportional to iteration reduction

Each ForEach iteration counts as a pipeline run. A ForEach over 10,000 files is 10,000 runs = $10. Often you can batch-process files in fewer ForEach iterations (e.g., process directories instead of files) or use Copy activity's wildcard pattern.

Schedule data flows for off-peak

Variable based on workload pattern

Data flows need Spark cluster spin-up (1-5 minutes per session). Aggregating multiple data flows into a single session reduces spin-up overhead. Use the same integration runtime across consecutive data flows.

Use mapping data flows efficiently

Data flows are expensive ($0.193/vCore-hour × 4 min spin-up + actual run). For simple transformations, Copy activity's built-in mapping is much cheaper. Reserve mapping data flows for complex transformations requiring Spark.

Skip Self-hosted IR when Azure-hosted will work

Self-hosted IR requires running a VM (compute cost) and is more expensive per activity. Only use when accessing on-prem data sources or networks without public access.

FAQ

Why is Azure Data Factory so complex to estimate?

Multiple billing dimensions (orchestration, DIU for movement, vCores for data flows) that each scale differently. Activity counts depend on pipeline structure. Data movement DIU depends on data volume and 'Auto' tuning. Estimate based on pipeline complexity + expected execution frequency.

When should I use Data Factory vs Synapse Pipelines?

Synapse Pipelines is essentially Data Factory inside Azure Synapse Analytics. Same engine, similar pricing. If you're using Synapse for the data warehouse, use Synapse Pipelines. If you're not on Synapse, Data Factory works standalone. Capabilities are nearly identical.

Is Data Factory cheaper than Databricks for ETL?

Depends on complexity. For simple ingest/transform (CSV to Parquet, basic filtering), Data Factory's Copy activity is cheaper than spinning up Databricks. For complex transformations needing custom Python/Scala code or ML model integration, Databricks is more flexible and often more cost-effective for the actual processing.

How do I reduce DIU costs?

Three levers. (1) Set DIU=4 manually for small data sources instead of Auto. (2) Combine small files into batches before processing. (3) Use compression (Parquet, gzip) to reduce transfer time. DIU-hours are time × DIU count, so reducing either lowers cost.