How to configure PHP in Airflow?

darkotodoric

Darko Todorić

Posted on May 23, 2024

How to configure PHP in Airflow?

Apache Airflow is an open-source platform used for managing complex workflows. It allows users to schedule, monitor, and manage tasks and commands.

In essence, Airflow serves as a powerful alternative to traditional cron-based scheduling systems. While cron (or crontab) is widely used for scheduling repetitive tasks on Unix-like systems, it lacks features for dependency management, monitoring, and dynamic scheduling, which are crucial for managing complex workflows. Airflow, on the other hand, provides a robust framework for defining, scheduling, and executing workflows, making it an ideal choice for data pipelines.

Airflow

While Airflow is primarily associated with Python, its flexibility allows users to integrate tasks written in other languages, such as PHP. This opens up new possibilities for leveraging existing PHP codebases, libraries, and expertise within Airflow workflows. In this guide, we'll explore how to configure PHP tasks within Airflow, enabling you to harness the power of both Airflow and PHP in your data pipelines.

I'm writing this guide because I couldn't find any tutorials explaining how to set up PHP with Airflow until now.

At the end of the article, you can find my GitHub repository link as an example of how to properly integrate PHP and Airflow. You can skip reading the entire explanation of how everything works if you are only interested in the solution.


Do you really need Airflow?

Do you really need another relatively complex tool that will complicate your infrastructure? It depends... Most projects use "crontab" to schedule commands and it does the job very nicely, but when the number of commands inside crontab exceeds 100, and even 500, the situation starts to become drastically complicated and the management of all those commands through crontab is practically impossible.


Airflow structure

If you're still reading this, you've obviously run into the same problem as me your "crontab" has become too small for the number of commands that are there and you need a more complex solution.

Airflow consists of 5 key services that all work together to ensure that everything runs smoothly. These are the 5 key services:

  • PostgreSQL: Sets up a PostgreSQL database for Airflow to store metadata
  • Redis: Provides an in-memory data structure store, used by Airflow for task queue management
  • Airflow Scheduler: Manages the scheduling of tasks and ensures that they are executed at the right time
  • Airflow Worker: Executes the tasks scheduled by the Airflow scheduler
  • Airflow Webserver: Hosts the Airflow web interface, allowing users to monitor and manage workflows through a web browser

Integrating PHP into Airflow

To integrate PHP into Airflow, we need to ensure that PHP scripts are executed where the commands are run, which is within the Airflow worker. The Airflow worker is responsible for executing the tasks scheduled by the Airflow scheduler.

Here's how we can do it:

  1. Modify the Airflow Worker Dockerfile: Ensure that PHP is installed in the Airflow worker's environment. This can be done by modifying the Dockerfile used to build the Airflow worker image.

  2. Install PHP: Add the necessary commands to install PHP and any required extensions in the Dockerfile.

  3. Run PHP Scripts: Use the BashOperator in Airflow to run PHP scripts as tasks within your DAGs. This way, when a task is executed by the Airflow worker, it can invoke the PHP script.

By following these steps, you can integrate PHP into your Airflow environment, allowing you to utilize PHP for various tasks within your workflows.

For a complete example with functional code, you can find the full setup and instructions at https://github.com/darkotodoric/php-in-airflow

💖 💪 🙅 🚩
darkotodoric
Darko Todorić

Posted on May 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

How to configure PHP in Airflow?