🎯 Data Orchestration

Apache Airflow

An open-source platform to programmatically author, schedule, and monitor workflows, commonly used for orchestrating data pipelines and ETL jobs.

Apache Airflow is an open-source workflow orchestration platform created at Airbnb and now maintained by the Apache Software Foundation. It allows you to define, schedule, and monitor complex data workflows as code.

Core Concepts

DAGs (Directed Acyclic Graphs)


Workflows in Airflow are defined as DAGs—collections of tasks with dependencies. The "acyclic" property ensures no circular dependencies.

``python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG('my_etl_dag', start_date=datetime(2024, 1, 1), schedule='@daily') as dag:
extract = PythonOperator(task_id='extract', python_callable=extract_data)
transform = PythonOperator(task_id='transform', python_callable=transform_data)
load = PythonOperator(task_id='load', python_callable=load_data)

extract >> transform >> load
``

Operators


Pre-built task types for common operations:
- PythonOperator: Run Python functions
- BashOperator: Execute shell commands
- SQLOperator: Run SQL queries
- S3/GCS Operators: Interact with cloud storage
- Provider Operators: Snowflake, BigQuery, dbt, etc.

Executors


How Airflow runs tasks:
- SequentialExecutor: One task at a time (development)
- LocalExecutor: Parallel on single machine
- CeleryExecutor: Distributed across workers
- KubernetesExecutor: Pods per task for isolation

Key Features

- Workflows as Code: Version control your pipelines
- Rich UI: Visual DAG graph, logs, retry handling
- Extensible: Build custom operators and sensors
- Integrations: 1000+ provider packages
- Backfills: Re-run historical data processing

Airflow vs Alternatives

| Tool | Strength | Best For |
|------|----------|----------|
| Airflow | Flexibility, ecosystem | Complex custom workflows |
| Prefect | Python-native, cloud-first | Modern data apps |
| Dagster | Software-defined assets | Data platform teams |
| dbt Cloud | SQL transformations | Analytics engineering |

Common Use Cases

1. ETL/ELT Pipelines: Orchestrate data movement and transformation
2. ML Workflows: Training, validation, deployment pipelines
3. Reporting: Schedule automated report generation
4. Data Quality: Trigger validation checks on new data

Key Points

Frequently Asked Questions

What is Apache Airflow used for?

Apache Airflow is used for orchestrating complex workflows, particularly data pipelines. It schedules tasks, manages dependencies, handles retries, and provides monitoring through a web interface.

Is Airflow an ETL tool?

Airflow is an orchestration tool, not an ETL tool. It schedules and monitors ETL jobs but does not extract, transform, or load data itself. You use Airflow to coordinate tools like dbt, Python scripts, or SQL queries.

What is a DAG in Airflow?

A DAG (Directed Acyclic Graph) is a collection of tasks with defined dependencies. It represents a complete workflow where each task runs after its upstream dependencies complete successfully.

Is Airflow free to use?

Yes, Apache Airflow is 100% free and open-source. Managed versions like Astronomer, MWAA (AWS), and Cloud Composer (Google) offer paid hosting with additional features.

← Back to Glossary

Last updated: 2026-01-21

SR

Published by

Sainath Reddy

Data Engineer at Anblicks
🎯 4+ years experience