Change Data Capture (CDC) is a technique that identifies and captures changes (inserts, updates, deletes) made to data in a database. Instead of copying entire tables, CDC streams only the changes, enabling efficient real-time data replication.
Why CDC Matters
Traditional batch extraction is inefficient:
- Full loads: Copy entire tables repeatedly
- Timestamp-based: Misses deletes, requires modification tracking
- Performance impact: Heavy queries on source systems
CDC solves these issues by capturing changes at the source.
CDC Methods
1. Log-Based CDC: Read database transaction logs (most efficient)
2. Trigger-Based CDC: Database triggers capture changes
3. Timestamp-Based: Query for recently modified rows (not true CDC)
4. Diff-Based: Compare snapshots (resource intensive)
Log-Based CDC Flow
````
Source DB → Transaction Log → CDC Tool → Target System
(MySQL, (binlog, WAL) (Debezium) (Kafka, DW)
Postgres)
Popular CDC Tools
| Tool | Type | Best For |
|------|------|----------|
| Debezium | Open Source | Kafka integration |
| Fivetran | Managed | Easy setup |
| Airbyte | Open Source | Self-hosted |
| AWS DMS | Cloud | AWS ecosystems |
| Striim | Enterprise | Complex transforms |
CDC Use Cases
- Data Warehousing: Near-real-time warehouse updates
- Microservices: Sync data between services
- Event Sourcing: Capture all state changes
- Cache Invalidation: Update caches on data change
- Search Indexing: Keep Elasticsearch in sync