Data Processing with Airflow

Table of Contents

Overview

Apache Airflow® is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web-based UI helps you visualize, manage, and debug your workflows. You can run Airflow in a variety of configurations — from a single process on your laptop to a distributed system capable of handling massive workloads.”

Getting Started

The official Getting Started documentation for airflow is found here: Airflow Documentation.

  • The quickest and simplest way to get started is running it in standalone mode, using the airflow package available from PyPi.

For example

# Typical command for installation using PyPi : e.g. latest version of airflow is 3.1.3, and python version is 3.12

pip install "apache-airflow==3.1.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.1.3/constraints-3.12.txt"

This can also be done as follows:

# Set the version of Airflow you want to install here
AIRFLOW_VERSION=3.1.3

# Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually.
# See above for supported versions.
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example this would install 3.0.0 with python 3.10: https://raw.githubusercontent.com/apache/airflow/constraints-3.1.6/constraints-3.10.txt

uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Getting Started with Airflow in the Context of Destination Earth - Data Lake

We have made several projects available on GitHub that can get you started with airflow in the context of Destination Earth - Data Lake.

  • airflow-getting-started
  • airflow-kubernetes
    • This project walks you through deploying airflow on an Islet Kubernetes Cluster.

      • Useful for exposing your DAGs (workflows) to external systems. e.g. you could trigger DAGs from the DEDL Stack Service (JupyterHub)

      • Deployment with Helm is demonstrated

      • Exposing the Airflow UI and API is demonstrated

      • A Python client is provided allowing you to trigger your DAGs from external systems (e.g on a VM, Jupyter Notebook, etc)

    • Destination Earth on Github - airflow-kubernetes

  • airflow-kubernetes-dags
    • An important point when deploying airflow on Kubernetes is how to manage your DAGs (workflows).

    • Here we provide a project that demonstrates one way of doing this.

    • The objective here is to setup your own repository of DAGs (e.g. on GitHub) and then use a git-sync sidecar container to pull the latest DAGs into your airflow deployment.

    • Destination Earth on Github - airflow-kubernetes dags