azurerm_data_factory cost estimation
Managed ETL/ELT and data integration. Pipeline orchestration ($1/1K runs), data flow execution ($0.193-$0.34/vCore-hour for General Purpose), and Integration Runtime hours. Data movement billed separately ($0.25/DIU-hour Azure-hosted).
Azure Data Factory (ADF) is Microsoft's managed ETL/ELT service. The azurerm_data_factory creates the factory; pipelines, data flows, and triggers are configured via separate resources. Pricing has several distinct components that often surprise teams.
Pipeline orchestration: - $1 per 1,000 pipeline activity runs (Azure-hosted Integration Runtime) - $1.50 per 1,000 runs (Self-hosted IR) - Includes activities like Copy, Lookup, ForEach, Wait, Set Variable
Data movement (Copy activity): - Azure-hosted IR: $0.25 per DIU-hour (Data Integration Unit). 4 DIU minimum per Copy activity. - Self-hosted IR: $0.10 per hour per node
Data flow execution (visual ETL on Spark backend): - General Purpose: $0.193/vCore-hour - Memory Optimized: $0.34/vCore-hour - Compute Optimized: $0.193/vCore-hour - Minimum 4 cores; typical data flow runs 8-32 cores for several minutes
A typical pipeline running hourly: - 24 pipeline runs/day × 30 = 720 runs/month - Pipeline orchestration: 720 × $0.001 = $0.72/month - Copy activity: 4 DIU × 720 × 0.1 hours (6 min avg) = 288 DIU-hours × $0.25 = $72/month - Total: ~$73/month for the pipeline
Self-hosted Integration Runtime (for on-prem data) requires running a VM with the IR agent. The VM costs are billed separately from ADF.
c3x estimates Data Factory based on declared pipelines and (via c3x-usage.yml) expected activity counts and Copy activity volume.
Terraform example
A minimal but realistic configuration that C3X can estimate.
resource "azurerm_data_factory" "main" {
name = "prod-data-factory"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
identity {
type = "SystemAssigned"
}
tags = {
Environment = "production"
}
}
resource "azurerm_data_factory_integration_runtime_self_hosted" "onprem" {
name = "onprem-ir"
data_factory_id = azurerm_data_factory.main.id
}Pricing dimensions
What you actually pay for when you provision azurerm_data_factory.
| Dimension | Unit | What's being charged |
|---|---|---|
| Pipeline activity runs | per 1,000 runs | Orchestration fee per pipeline activity execution. $1 per 1,000 runs (Azure IR) |
| Data movement (DIU) | per DIU-hour | Copy activity uses Data Integration Units. Min 4 DIU per Copy activity. $0.25/DIU-hour Azure-hosted |
| Data flow execution | per vCore-hour | Mapping data flows run on Spark. Min 4 cores. $0.193/vCore-hour General Purpose |
| Self-hosted IR | per hour per activity | Charged when activities run via self-hosted Integration Runtime. $0.10/hour per activity |
| External activity runs | per 1,000 runs | Activities that run on external services (Databricks, HDInsight, etc.). $0.25 per 1,000 runs |
Optimization tips
Common ways to reduce azurerm_data_factory cost without changing the workload.
Tune Copy activity DIU count
20-50% on Copy costDIU defaults at Auto, which Azure picks based on data size. For small transfers (under 1 GB), Auto may over-provision. Set DIU=4 explicitly for small files to avoid paying for unused capacity. For large files, increase DIU for faster transfer (offset by reduced runtime).
Minimize ForEach iterations
Proportional to iteration reductionEach ForEach iteration counts as a pipeline run. A ForEach over 10,000 files is 10,000 runs = $10. Often you can batch-process files in fewer ForEach iterations (e.g., process directories instead of files) or use Copy activity's wildcard pattern.
Schedule data flows for off-peak
Variable based on workload patternData flows need Spark cluster spin-up (1-5 minutes per session). Aggregating multiple data flows into a single session reduces spin-up overhead. Use the same integration runtime across consecutive data flows.
Use mapping data flows efficiently
Data flows are expensive ($0.193/vCore-hour × 4 min spin-up + actual run). For simple transformations, Copy activity's built-in mapping is much cheaper. Reserve mapping data flows for complex transformations requiring Spark.
Skip Self-hosted IR when Azure-hosted will work
Self-hosted IR requires running a VM (compute cost) and is more expensive per activity. Only use when accessing on-prem data sources or networks without public access.
FAQ
Why is Azure Data Factory so complex to estimate?
Multiple billing dimensions (orchestration, DIU for movement, vCores for data flows) that each scale differently. Activity counts depend on pipeline structure. Data movement DIU depends on data volume and 'Auto' tuning. Estimate based on pipeline complexity + expected execution frequency.
When should I use Data Factory vs Synapse Pipelines?
Synapse Pipelines is essentially Data Factory inside Azure Synapse Analytics. Same engine, similar pricing. If you're using Synapse for the data warehouse, use Synapse Pipelines. If you're not on Synapse, Data Factory works standalone. Capabilities are nearly identical.
Is Data Factory cheaper than Databricks for ETL?
Depends on complexity. For simple ingest/transform (CSV to Parquet, basic filtering), Data Factory's Copy activity is cheaper than spinning up Databricks. For complex transformations needing custom Python/Scala code or ML model integration, Databricks is more flexible and often more cost-effective for the actual processing.
How do I reduce DIU costs?
Three levers. (1) Set DIU=4 manually for small data sources instead of Auto. (2) Combine small files into batches before processing. (3) Use compression (Parquet, gzip) to reduce transfer time. DIU-hours are time × DIU count, so reducing either lowers cost.
Related resources
Estimate this resource in your own Terraform
Free, open source, no API key. C3X parses your Terraform and shows line-item cost for every resource, including azurerm_data_factory.