🔗 Data Integration

Change Data Capture (CDC)

A technique for identifying and capturing changes made to data in a database, enabling real-time or near-real-time data replication to other systems.

Change Data Capture (CDC) is a technique that identifies and captures changes (inserts, updates, deletes) made to data in a database. Instead of copying entire tables, CDC streams only the changes, enabling efficient real-time data replication.

Why CDC Matters

Traditional batch extraction is inefficient:
- Full loads: Copy entire tables repeatedly
- Timestamp-based: Misses deletes, requires modification tracking
- Performance impact: Heavy queries on source systems

CDC solves these issues by capturing changes at the source.

CDC Methods

1. Log-Based CDC: Read database transaction logs (most efficient)
2. Trigger-Based CDC: Database triggers capture changes
3. Timestamp-Based: Query for recently modified rows (not true CDC)
4. Diff-Based: Compare snapshots (resource intensive)

Log-Based CDC Flow

``
Source DB → Transaction Log → CDC Tool → Target System
(MySQL, (binlog, WAL) (Debezium) (Kafka, DW)
Postgres)
``

Popular CDC Tools

| Tool | Type | Best For |
|------|------|----------|
| Debezium | Open Source | Kafka integration |
| Fivetran | Managed | Easy setup |
| Airbyte | Open Source | Self-hosted |
| AWS DMS | Cloud | AWS ecosystems |
| Striim | Enterprise | Complex transforms |

CDC Use Cases

- Data Warehousing: Near-real-time warehouse updates
- Microservices: Sync data between services
- Event Sourcing: Capture all state changes
- Cache Invalidation: Update caches on data change
- Search Indexing: Keep Elasticsearch in sync

Key Points

Frequently Asked Questions

What is Change Data Capture (CDC)?

CDC is a technique for capturing changes made to data in a database (inserts, updates, deletes) and streaming them to other systems. It enables real-time data replication without copying entire tables.

Why is CDC better than batch extraction?

CDC is more efficient because it only transfers changed data, not entire tables. It captures deletes (which timestamp methods miss), has lower impact on source systems, and enables near-real-time data freshness.

What is Debezium?

Debezium is an open-source CDC platform that reads database transaction logs and streams changes to Apache Kafka. It supports MySQL, PostgreSQL, MongoDB, SQL Server, and other databases.

What is log-based CDC?

Log-based CDC reads the database transaction log (binlog in MySQL, WAL in PostgreSQL) to capture changes. This is the most efficient CDC method as it does not query the database directly.

← Back to Glossary

Last updated: 2026-01-21

SR

Published by

Sainath Reddy

Data Engineer at Anblicks
🎯 4+ years experience