What is a data pipeline?

A data pipeline is an automated series of steps that extracts data from source systems (databases, APIs, files), transforms it (cleans, validates, enriches), and loads it into a destination (data warehouse, data lake, or other system).

What is the difference between ETL and ELT?

ETL transforms data before loading into the warehouse (transform in staging). ELT loads raw data first, then transforms inside the warehouse using its compute power. ELT is more common with modern cloud warehouses like Snowflake and BigQuery.

What tools are used to build data pipelines?

Common tools include: Airflow/Dagster/Prefect for orchestration, Fivetran/Airbyte for ingestion, dbt/Spark for transformation, and Snowflake/BigQuery for storage. The specific stack depends on your requirements.

Data Pipeline - Data Engineering Glossary

Q: What is the difference between batch and streaming pipelines?

Batch pipelines process data in scheduled intervals (e.g., every hour). Streaming pipelines process data in real-time as it arrives. Batch is simpler and cheaper; streaming provides lower latency but is more complex.

A Data Pipeline is an automated workflow that moves and transforms data from source systems to destination systems. It is the foundational infrastructure of data engineering — without pipelines, data stays trapped in operational systems and never reaches analysts, dashboards, or ML models.

How Data Pipelines Work

``┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────────┐ │ Sources │──→ │ Extract │──→ │Transform │──→ │ Load │ │ │ │ │ │ │ │ │ │ Databases│ │ APIs │ │ Clean │ │ Data │ │ APIs │ │ CDC │ │ Validate │ │ Warehouse │ │ Files │ │ Streaming │ │ Enrich │ │ Data Lake │ │ SaaS │ │ │ │ Aggregate│ │ Feature Store│ └──────────┘ └───────────┘ └──────────┘ └──────────────┘`

`Types of Data Pipelines`

`Batch Pipelines`


- Process data in scheduled intervals (hourly, daily)
- Best for: Reports, analytics, historical analysis
- Tools: Airflow, dbt, Spark, AWS Glue
Streaming Pipelines

- Process data in real-time as it arrives
- Best for: Fraud detection, real-time dashboards, alerts
- Tools: Kafka, Flink, Spark Streaming, Kinesis
Hybrid Pipelines (Lambda/Kappa)

- Combine batch and streaming for different latency needs
- Lambda: Separate batch + speed layers
- Kappa: Single streaming layer serves all needs
Pipeline Patterns
ETL (Extract, Transform, Load)

Transform data before loading into the warehouse:


Source → Transform (staging) → Load (warehouse)


Best when you need data quality guarantees before loading.
ELT (Extract, Load, Transform)

Load raw data first, transform inside the warehouse:


Source → Load (warehouse raw) → Transform (warehouse clean)


Best when your warehouse has powerful compute (Snowflake, BigQuery).
Reverse ETL

Move data from the warehouse back to operational tools:


Warehouse → CRM, Marketing tools, Product databases

Key Infrastructure Components

| Component | Purpose | Tools |
|-----------|---------|-------|
| Ingestion | Extract data from sources | Fivetran, Airbyte, Kafka Connect |
| Orchestration | Schedule and monitor pipelines | Airflow, Dagster, Prefect |
| Transformation | Clean and model data | dbt, Spark, Dataform |
| Storage | Store processed data | Snowflake, BigQuery, S3 + Iceberg |
| Quality | Validate data integrity | Great Expectations, dbt tests, Soda |
| Observability | Monitor pipeline health | Monte Carlo, Datadog, Atlan |

Best Practices

1. Idempotency: Running a pipeline twice produces the same result
2. Incremental Processing: Only process new/changed data
3. Schema Evolution: Handle schema changes gracefully
4. Testing: Unit test transforms, integration test end-to-end
5. Alerting: Get notified on failures, delays, and data quality issues
6. Documentation: Document data lineage and business logic

Data Pipeline

How Data Pipelines Work

`Types of Data Pipelines`

`Batch Pipelines`

Streaming Pipelines

Hybrid Pipelines (Lambda/Kappa)

Pipeline Patterns

ETL (Extract, Transform, Load)

ELT (Extract, Load, Transform)

Reverse ETL

Key Infrastructure Components

Best Practices

Key Points

Frequently Asked Questions

What is a data pipeline?

What is the difference between ETL and ELT?

What tools are used to build data pipelines?

What is the difference between batch and streaming pipelines?

Learn More

Sainath Reddy

Data Pipeline

How Data Pipelines Work

Types of Data Pipelines

Batch Pipelines

Streaming Pipelines

Hybrid Pipelines (Lambda/Kappa)

Pipeline Patterns

ETL (Extract, Transform, Load)

ELT (Extract, Load, Transform)

Reverse ETL

Key Infrastructure Components

Best Practices

Key Points

Frequently Asked Questions

What is a data pipeline?

What is the difference between ETL and ELT?

What tools are used to build data pipelines?

What is the difference between batch and streaming pipelines?

Related Terms

Learn More

Sainath Reddy

`Types of Data Pipelines`

`Batch Pipelines`