Data Processing with Airflow
Table of Contents
Overview
“Apache Airflow® is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web-based UI helps you visualize, manage, and debug your workflows. You can run Airflow in a variety of configurations — from a single process on your laptop to a distributed system capable of handling massive workloads.”
Getting Started
The official Getting Started documentation for airflow is found here: Airflow Documentation.
The quickest and simplest way to get started is running it in standalone mode, using the airflow package available from PyPi.
Full installation instructions can be found here : Airflow Installation.
For example
# Typical command for installation using PyPi : e.g. latest version of airflow is 3.1.3, and python version is 3.12 pip install "apache-airflow==3.1.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.1.3/constraints-3.12.txt"This can also be done as follows:
# Set the version of Airflow you want to install here AIRFLOW_VERSION=3.1.3 # Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually. # See above for supported versions. PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" # For example this would install 3.0.0 with python 3.10: https://raw.githubusercontent.com/apache/airflow/constraints-3.1.6/constraints-3.10.txt uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Getting Started with Airflow in the Context of Destination Earth - Data Lake
We have made several projects available on GitHub that can get you started with airflow in the context of Destination Earth - Data Lake.
- airflow-getting-started
This project walks through the basics and focusses quickly on getting the most from airflow.
This uses the standalone installation of airflow (i.e. not on Kubernetes)
- airflow-kubernetes
This project walks you through deploying airflow on an Islet Kubernetes Cluster.
Useful for exposing your DAGs (workflows) to external systems. e.g. you could trigger DAGs from the DEDL Stack Service (JupyterHub)
Deployment with Helm is demonstrated
Exposing the Airflow UI and API is demonstrated
A Python client is provided allowing you to trigger your DAGs from external systems (e.g on a VM, Jupyter Notebook, etc)
- airflow-kubernetes-dags
An important point when deploying airflow on Kubernetes is how to manage your DAGs (workflows).
Here we provide a project that demonstrates one way of doing this.
The objective here is to setup your own repository of DAGs (e.g. on GitHub) and then use a git-sync sidecar container to pull the latest DAGs into your airflow deployment.