Apache Airflow is an open-source workflow orchestration platform created at Airbnb and now maintained by the Apache Software Foundation. It allows you to define, schedule, and monitor complex data workflows as code.
Core Concepts
DAGs (Directed Acyclic Graphs)
Workflows in Airflow are defined as DAGs—collections of tasks with dependencies. The "acyclic" property ensures no circular dependencies.
``python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
with DAG('my_etl_dag', start_date=datetime(2024, 1, 1), schedule='@daily') as dag:
extract = PythonOperator(task_id='extract', python_callable=extract_data)
transform = PythonOperator(task_id='transform', python_callable=transform_data)
load = PythonOperator(task_id='load', python_callable=load_data)
extract >> transform >> load
``
Operators
Pre-built task types for common operations:
- PythonOperator: Run Python functions
- BashOperator: Execute shell commands
- SQLOperator: Run SQL queries
- S3/GCS Operators: Interact with cloud storage
- Provider Operators: Snowflake, BigQuery, dbt, etc.
Executors
How Airflow runs tasks:
- SequentialExecutor: One task at a time (development)
- LocalExecutor: Parallel on single machine
- CeleryExecutor: Distributed across workers
- KubernetesExecutor: Pods per task for isolation
Key Features
- Workflows as Code: Version control your pipelines
- Rich UI: Visual DAG graph, logs, retry handling
- Extensible: Build custom operators and sensors
- Integrations: 1000+ provider packages
- Backfills: Re-run historical data processing
Airflow vs Alternatives
| Tool | Strength | Best For |
|------|----------|----------|
| Airflow | Flexibility, ecosystem | Complex custom workflows |
| Prefect | Python-native, cloud-first | Modern data apps |
| Dagster | Software-defined assets | Data platform teams |
| dbt Cloud | SQL transformations | Analytics engineering |
Common Use Cases
1. ETL/ELT Pipelines: Orchestrate data movement and transformation
2. ML Workflows: Training, validation, deployment pipelines
3. Reporting: Schedule automated report generation
4. Data Quality: Trigger validation checks on new data