Azure Data Factory's orchestration problem

09.06.2023 | 2 min Read
Category: Data Engineering

Azure Data Factory has an inherent limitation in how it runs activities in parallel, which is also not resolved by Fabric. The solution is to use a dedicated orchestration service such as Azure's recently introduced *Managed Airflow*

Azure Data Factory (ADF) has a weakness in how it handles parallelisation in “ForEach loops”, which can lead to significant time delays and inefficiency for data platforms of a certain scale. Since ADF is an integral part of Microsoft Fabric, this is an important limitation to be aware of.

The article published on Medium is linked at the bottom here, with a brief summary first.

In the article, I argue that the use of a dedicated orchestration service such as Airflow, especially in light of Azure’s recently introduced “Managed Airflow”, can improve performance and solve this problem. With concrete examples, it can be seen that when you run ADF and Airflow together, you achieve runtimes that are consistent regardless of random factors such as the order of tasks in ADF. Despite being slightly slower than ADF’s optimal run, it is significantly faster than the least optimal case.

In addition to solving this specific challenge, Airflow also offers a range of connectors and expanded capabilities, such as using Python anywhere in an orchestration flow. My conclusion is that there is definitely a need for a dedicated orchestration tool such as Managed Airflow in Azure.

author image

Halvar Trøyel Nerbø

Trøyel is a dedicated Data Platform Engineer who has specialised in building datalake and lakehouse-based data platforms in the cloud.