Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse storage framework that brings database-like capabilities to data lakes. Originally created at Uber to solve their massive-scale incremental data processing challenges, Hudi enables efficient upserts, deletes, and incremental queries on data stored in cloud storage like S3 or GCS.
Why Apache Hudi?
Traditional data lakes store immutable files (Parquet, ORC). To update a single record, you must rewrite entire partitions — an expensive and slow operation. Hudi solves this by introducing:
- Record-level operations: Insert, update, and delete individual records efficiently
- ACID transactions: Ensure data consistency during concurrent reads and writes
- Incremental processing: Process only changed data since the last query
- Time travel: Query data at any point in time
Core Table Types
Copy on Write (CoW)
- Stores data in columnar Parquet files
- Updates rewrite entire files on write
- Best for: Read-heavy workloads, fewer updates
- Trade-off: Faster reads, slower writes
Merge on Read (MoR)
- Stores updates in row-based log files, periodically compacts
- Best for: Write-heavy workloads, frequent updates
- Trade-off: Faster writes, slightly slower reads (until compaction)
Key Features
1. Upsert Support: Native INSERT, UPDATE, DELETE at record level
2. Incremental Queries: Read only new/changed data since a timestamp
3. Compaction: Background process to optimize read performance
4. Metadata Table: Fast file listing for large datasets (millions of files)
5. Multi-Engine Support: Works with Spark, Flink, Presto, Trino, Hive
Hudi vs Iceberg vs Delta Lake
| Feature | Hudi | Iceberg | Delta Lake |
|---------|------|---------|------------|
| Origin | Uber | Netflix | Databricks |
| Upsert Performance | Excellent | Good | Good |
| Read Performance | Good | Excellent | Excellent |
| Engine Support | Broad | Broadest | Spark-centric |
| Incremental Processing | Native | Via snapshots | Via CDF |
Common Use Cases
1. CDC Ingestion: Stream database changes into the data lake
2. GDPR Compliance: Delete specific user records efficiently
3. Near Real-Time Analytics: Process streaming data with low latency
4. Data Pipeline Deduplication: Handle late-arriving and duplicate data
5. Incremental ETL: Process only changed data to reduce compute costs