Apache Airflow
Sandeep
Posted on September 18, 2024
Airflow — overview
Apache Airflow is an open-source platform to run any type of workflow. Airflow uses Python programming language to define the pipelines. Users can take full advantage of that by using for loop to define pipelines, executing bash commands, using any external modules like pandas, sklearn or GCP or AWS libraries to manage cloud services and much, much more.
Airflow is a reliable solution which was trusted by many companies. Pinterest used to face some performance and scalability issues and deal with high maintenance costs. GoDaddy has many batch analytics and data teams that need an orchestration tool and ready-made operators for building ETL pipelines. DXC Technology delivered a client’s project that required massive data storage, hence required a stable orchestration engine. All these challenges have been worked out by implementing the right deployment of Airflow.
Apache Airflow use cases
The versatility of Airflow allows you to use it to schedule any type of workflows. Apache Airflow can run ad hoc workloads that are not related to any schedule or interval. However, it works best for pipelines:
- that change slowly
- related to the time interval
- scheduled on time
By “changing slowly” means that the pipeline, once deployed, is expected to differ from time to time (days/weeks rather than hours or minutes). It is connected to a lack of Airflow pipelines’ versioning. The "related to the time interval" means that the Airflow is best suited for processing data intervals. That’s also why Airflow works best when pipelines are scheduled to run on a specific time. Although it is possible to trigger the pipelines manually or using external triggers (for example via REST API).
Apache Airflow can be used to schedule:
- ETL pipelines that extract data from multiple sources and run Spark jobs or any other data transformations
- Training machine learning models
- Report generation
- Backups and similar DevOps operations
And much more! You can even write a pipeline to brew coffee every few hours, it will need some custom integrations, but that’s the biggest power of Airflow — it’s pure Python and everything can be programmed.
Let’s start with a few base concepts of Airflow!
Airflow DAG
Workflows are defined in Airflow by DAGs (Directed Acyclic Graphs) and are nothing more than a python file. A single DAG file may contain multiple DAG definitions, although it is recommended to keep one DAG per file.
Let’s take a look at example DAG:
from airflow.models import DAG
from airflow.utils.dates import days_ago
with DAG(
"etl_sales_daily",
start_date=days_ago(1),
schedule_interval=None,
) as dag:
...
First of all, DAG is identified by unique dag_id which has to be unique in whole Airflow deployment. Additionally, to create a DAG we need to specify:
- schedule_interval—which defines when the DAG should be run. It can be time delta object for example time delta(days=2) or a string cron expression * * * * *. It can be None and then the DAG will not be scheduled by Airflow, but it can still be triggered manually or externally.
- start_date - a date (datetime object) from which the DAG will start running. This helps run a DAG for past dates. It is common to use the days_ago function to specify this value. If the date is in the future you can still trigger the dag manually.
Once we have this baseline, we can start adding tasks to our DAG:
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.dummy_operator import DummyOperator
with DAG(
"etl_sales_daily",
start_date=days_ago(1),
schedule_interval=None,
) as dag:
task_a = DummyOperator(task_id="task_a")
task_b = DummyOperator(task_id="task_b")
task_c = DummyOperator(task_id="task_c")
task_d = DummyOperator(task_id="task_d")
task_a >> [task_b, task_c]
task_c >> task_d
Every task in a Airflow DAG is defined by the operator (we will dive into more details soon) and has its own task_id that has to be unique within a DAG. Each task has a set of dependencies that define its relationships to other tasks. These include:
- Upstream tasks — a set of tasks that will be executed before this particular task.
- Downstream tasks — set of tasks that will be executed after this task.
In our example task_b and task_c are downstream of task_a. And respectively task_a is in upstream of both task_b and task_c. A common way of specifying a relation between tasks is using the >> operator which works for tasks and collection of tasks (for example list or sets).
This is how a graphical representation of this DAG looks like:
Additionally, each task can specify trigger_rule which allows users to make the relations between tasks even more complex. Examples of trigger rules are:
- all_success—meaning that all tasks in upstream of a task have to succeed before Airflow attempts to execute this task
- one_success— one succeeded task in upstream is enough to trigger a task with this rule
- none_failed— each task in upstream has to either succeed or be skipped, no failed tasks are allowed to trigger this task
- All of this allows users to define arbitrarily complex workflows in a really simple way.
All of this allows users to define arbitrarily complex workflows in a really simple way.
Posted on September 18, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.