Data lineage is the complete lifecycle of data from origin to consumption. It documents where data comes from, how it moves through systems, what transformations are applied, and what depends on it. Think of it as a visual map of your data's journey.
Types of Data Lineage
1. Technical Lineage: Column-to-column mappings, SQL transformations
2. Business Lineage: High-level flow between business concepts
3. Operational Lineage: Runtime execution details and timing
Why Data Lineage Matters
- Impact Analysis: Know what breaks if you change a source
- Root Cause Analysis: Trace data issues to their origin
- Compliance: Demonstrate data handling for audits
- Trust: Understand where dashboard numbers come from
- Migration: Plan system changes with confidence
Data Lineage Components
- Source: Where data originates (database, API, file)
- Transformation: How data is modified (joins, aggregations)
- Destination: Where data lands (warehouse, dashboard)
- Dependencies: What downstream systems rely on this data
- Metadata: Column names, types, descriptions
Lineage Capture Methods
| Method | Pros | Cons |
|--------|------|------|
| SQL Parsing | Accurate, automatic | Complex to implement |
| API Integration | Real-time | Vendor-specific |
| Manual Documentation | Flexible | Outdated quickly |
| Log Analysis | Runtime truth | Incomplete picture |
Data Lineage Tools
- Atlan: Active metadata platform with lineage
- Alation: Data catalog with lineage
- OpenLineage: Open standard for lineage
- dbt: Built-in lineage for SQL models
- DataHub: Open-source metadata platform