Deeksha Gunde
Posted on November 4, 2024
Managing data transformations and dependencies across multiple projects in data engineering can be tricky. As more organizations move toward modularizing data pipelines, one of the major challenges faced by data engineers is how to handle dependencies between various DBT (data build tool) projects. In this blog, we will dive into how to tackle cross-project dependencies in DBT while orchestrating workflows using AWS Managed Workflows for Apache Airflow (MWAA).
Overview of DBT and Cross-Project Dependencies
DBT (Data Build Tool) allows you to define models in SQL, and those models are organized in projects. However, as the complexity of data pipelines grows, it's common for different DBT projects to have dependencies on models in other projects. Managing such dependencies in production environments is critical for smooth data pipeline execution.
I was trying to build on top of the Multi-repo strategy provided by @elliott_cordo in Avoiding the DBT Monolith, by extending the solution to run on MWAA environment.
In complex data pipeline environments, managing cross-project dependencies, especially for tools like dbt (Data Build Tool), can pose unique challenges. If you're running Apache Airflow on AWS Managed Workflows for Apache Airflow (MWAA) and integrating private Git repositories, securing access to these repositories during DAG execution often requires creative problem-solving.
I encountered a specific issue where dbt deps
command failed to clone a private Git repository in an MWAA environment due to issues with the SSH key stored in AWS Secrets Manager. This post explains how I attempted to resolve this problem with the PythonVirtualenvOperator and ultimately transitioned to packaging the parent dbt project inside a plugins.zip file. This method allowed me to resolve dependency issues and securely reference models from other dbt projects.
Setting Up AWS MWAA with Terraform
To run DBT projects in the cloud, AWS Managed Workflows for Apache Airflow (MWAA) provides an easy-to-manage solution. For this project, I used Terraform to create MWAA environments, manage Airflow DAGs, and provision other necessary resources. Here’s a snippet of how we set up the MWAA environment using Terraform:
module "mwaa" {
source = "aws-ia/mwaa/aws"
name = local.resource_name
airflow_version = var.mwaa_airflow_version
environment_class = var.environment_class
create_s3_bucket = false
source_bucket_arn = module.bucket.s3_bucket_arn
dag_s3_path = "dags"
iam_role_name = local.resource_name
requirements_s3_path = "mwaa/requirements.txt"
startup_script_s3_path = "mwaa/startup.sh"
min_workers = var.min_workers
max_workers = var.max_workers
vpc_id = try(var.vpc_id, module.vpc.vpc_id)
private_subnet_ids = try(var.private_subnet_ids, module.vpc.private_subnets)
}
This code provisions the MWAA environment and links it to an S3 bucket that houses Airflow DAGs, and Python dependencies.
One way to manage DBT Project Dependencies in Airflow
When orchestrating DBT projects using Airflow in MWAA, one of the primary concerns is making sure models from one DBT project can reference those from another. Airflow’s DAGs (Directed Acyclic Graphs) are designed to handle such dependencies, ensuring that one DBT project can finish its transformations before another one starts.
In this scenario, we can use the PythonVirtualenvOperator in Airflow, which allows us to create isolated virtual environments to run Python code, including DBT commands. We use this operator because DBT has difficulties resolving dependency issues with the standard Airflow constraints file, and modifying this file is complex and potentially unstable for long-term use. Refer to the Airflow constraint file documentation for more details.
The PythonVirtualenvOperator
allows us to install additional libraries and dependencies specific to each task, without interfering with the main Airflow environment or requiring permanent changes to Airflow’s global environment constraints. This flexibility is essential for DBT because DBT projects often require specific versions of libraries that may conflict with the dependencies needed by other parts of Airflow and modifying Airflow’s constraints file directly to add these dependencies could cause future instability and is generally discouraged. By isolating DBT’s dependencies within a virtual environment, we avoid compatibility issues, ensuring a stable execution environment for DBT commands, such as dbt deps
or dbt run
, without risking conflicts with Airflow’s core dependencies.
As part of a DAG in Airflow, I needed to clone a private dbt project from a Git repository. This project contained key models and macros that needed to be referenced in a child dbt project. To handle the secure access to this repository, I stored the SSH private key in AWS Secrets Manager and used the PythonVirtualenvOperator to run the dbt deps
command inside the virtual environment during DAG execution, which adds the parent projects to the dbt_packages making the models easily accessible and referable in the child DBT project models.
Here’s a simplified structure of the DAG I was working with that manages cross-project dependencies using PythonVirtualenvOperator:
from airflow import DAG
from airflow.operators.python import PythonVirtualenvOperator
from airflow.utils.dates import days_ago
from datetime import datetime
def dbt_run(selected_folder=None):
import os
os.system(f"dbt deps")
os.system(f"dbt run --models {selected_folder}")
with DAG('dbt_cross_project',
start_date=days_ago(1),
schedule_interval='@daily') as dag:
run_dbt_project = PythonVirtualenvOperator(
task_id='run_dbt_project',
python_callable=dbt_run,
op_kwargs={'selected_folder': 'dbt_poc_child'},
requirements=['dbt'],
system_site_packages=False
)
And the corresponding packages.yml file:
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.7
- package: dbt-labs/audit_helper
version: 0.9.0
# Replace below package references with your parent DBT projects
- git: 'git@github.com:d-gunde/dbt_poc.git'
revision: main
The idea was that by invoking the dbt deps
command within the virtual environment, I could securely clone the parent dbt project from the private repository using the SSH key stored in AWS Secrets Manager.
However, during DAG execution, I encountered the following error:
Permission denied (publickey).
Despite configuring the SSH key in Secrets Manager, the Airflow worker couldn’t access the private repository. Upon investigating the issue, it turned out to be related to how the libcrypto library and SSH keys are handled in the MWAA environment, which was preventing proper authentication.
The Pivot - Leveraging custom Plugins feature of Apache Airflow
After multiple attempts to resolve the issue with PythonVirtualenvOperator, it became clear that managing SSH keys dynamically during DAG execution wasn't the most reliable approach in MWAA. So, I opted for an alternative solution: zipping the parent dbt project and deploying it as part of plugins.zip in the Airflow environment.
I leveraged and modified the custom plugins feature of Apache Airflow. The custom plugins feature in Apache Airflow allows users to extend Airflow’s core functionality by adding custom operators, sensors, hooks, and other components. By packaging these extensions into a plugins.zip file and uploading it to Airflow, you can introduce new behaviors or integrate external systems into your workflows. However, instead of adding a custom operator/sensor, I have added the parent DBT projects as plugins, giving the following advantages:
a. No need for SSH keys: By pre-packaging the parent dbt project inside plugins.zip, I eliminated the need for SSH key management and authentication during the DAG run.
b. Local availability: The parent dbt project is now always available locally in the Airflow environment, ensuring that all models and macros are accessible without additional network calls.
c. Scalability: This approach scales well for environments with multiple dbt projects and can be easily automated with Terraform and other CI/CD tools.
It’s crucial to ensure that cross-project DBT dependencies are accounted for in the plugins and requirements files. Modify the MWAA provisioning code as below to add the plugins.zip:
module "mwaa" {
source = "aws-ia/mwaa/aws"
name = local.resource_name
airflow_version = var.mwaa_airflow_version
environment_class = var.environment_class
create_s3_bucket = false
source_bucket_arn = module.bucket.s3_bucket_arn
dag_s3_path = "dags"
iam_role_name = local.resource_name
plugins_s3_path = "mwaa/plugins.zip"
requirements_s3_path = "mwaa/requirements.txt"
startup_script_s3_path = "mwaa/startup.sh"
min_workers = var.min_workers
max_workers = var.max_workers
vpc_id = try(var.vpc_id, module.vpc.vpc_id)
private_subnet_ids = try(var.private_subnet_ids, module.vpc.private_subnets)
}
resource "aws_s3_bucket_object" "mwaa_plugins" {
bucket = var.mwaa_s3_bucket
key = "plugins.zip"
source = "path/to/plugins.zip"
}
With the parent dbt project now part of the Airflow environment, updated the packages.yml
file in the child dbt project to reference the parent project locally:
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.7
- package: dbt-labs/audit_helper
version: 0.9.0
- local: /usr/local/airflow/plugins/dbt_poc
By using the local keyword, dbt could access the parent project's models and macros from the Airflow plugin directory during DAG execution.
Handling Common Errors and Solutions
Working with DBT in Airflow is generally smooth, but cross-project dependencies can lead to some common issues:
Invalid Identifiers: One issue you may encounter is an SQL compilation error due to an invalid identifier when referencing models from another project. This happens when DBT can't find the model you're referencing. To fix this, ensure that the correct path and project configurations are in place within your DBT models.
SSH Key Issues: If your DBT project references a private git repository, you may run into permission errors like
Permission denied (publickey)
when trying to clone the repository. In my case, I stored the SSH private key in AWS Secrets Manager and retrieved it at runtime during the Airflow DAG execution. However, this approach did not seem to work.
Conclusion
Handling cross-project dependencies in DBT with AWS MWAA might seem complicated, but with the right setup and orchestration, it becomes manageable. By using tools like Terraform for provisioning, PythonVirtualenvOperator for isolated execution, and AWS MWAA environment for managing workflows, you can build scalable, reliable data pipelines that handle dependencies across DBT projects efficiently. While effective, this approach can add complexity to CI/CD pipelines due to the need for customized environment configurations and dependency management within MWAA.
An alternative to consider is the KubernetesPodOperator, which executes tasks in isolated Kubernetes pods, potentially simplifying dependency handling and CI/CD processes. Before fully committing to the PythonVirtualenvOperator approach, weigh in pros and cons of both the operators carefully before investing time for setup.
Posted on November 4, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.