🏢 Data Warehousing

Apache Hudi

An open-source data lakehouse platform that provides ACID transactions, record-level insert/update/delete, and incremental data processing on data lakes.

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse storage framework that brings database-like capabilities to data lakes. Originally created at Uber to solve their massive-scale incremental data processing challenges, Hudi enables efficient upserts, deletes, and incremental queries on data stored in cloud storage like S3 or GCS.

Why Apache Hudi?

Traditional data lakes store immutable files (Parquet, ORC). To update a single record, you must rewrite entire partitions — an expensive and slow operation. Hudi solves this by introducing:

- Record-level operations: Insert, update, and delete individual records efficiently
- ACID transactions: Ensure data consistency during concurrent reads and writes
- Incremental processing: Process only changed data since the last query
- Time travel: Query data at any point in time

Core Table Types

Copy on Write (CoW)


- Stores data in columnar Parquet files
- Updates rewrite entire files on write
- Best for: Read-heavy workloads, fewer updates
- Trade-off: Faster reads, slower writes

Merge on Read (MoR)


- Stores updates in row-based log files, periodically compacts
- Best for: Write-heavy workloads, frequent updates
- Trade-off: Faster writes, slightly slower reads (until compaction)

Key Features

1. Upsert Support: Native INSERT, UPDATE, DELETE at record level
2. Incremental Queries: Read only new/changed data since a timestamp
3. Compaction: Background process to optimize read performance
4. Metadata Table: Fast file listing for large datasets (millions of files)
5. Multi-Engine Support: Works with Spark, Flink, Presto, Trino, Hive

Hudi vs Iceberg vs Delta Lake

| Feature | Hudi | Iceberg | Delta Lake |
|---------|------|---------|------------|
| Origin | Uber | Netflix | Databricks |
| Upsert Performance | Excellent | Good | Good |
| Read Performance | Good | Excellent | Excellent |
| Engine Support | Broad | Broadest | Spark-centric |
| Incremental Processing | Native | Via snapshots | Via CDF |

Common Use Cases

1. CDC Ingestion: Stream database changes into the data lake
2. GDPR Compliance: Delete specific user records efficiently
3. Near Real-Time Analytics: Process streaming data with low latency
4. Data Pipeline Deduplication: Handle late-arriving and duplicate data
5. Incremental ETL: Process only changed data to reduce compute costs

Key Points

Frequently Asked Questions

What is Apache Hudi used for?

Apache Hudi is used for building data lakehouses — bringing database-like capabilities (upserts, deletes, ACID transactions) to data lakes. It's particularly valued for CDC ingestion, GDPR compliance (record deletion), and incremental data processing.

What is the difference between Apache Hudi and Apache Iceberg?

Both are table formats for data lakes. Hudi excels at write-heavy workloads with native upsert and incremental processing. Iceberg focuses on read performance and engine-agnostic design. Hudi was created at Uber; Iceberg was created at Netflix.

Is Apache Hudi free?

Yes. Apache Hudi is 100% free and open-source under the Apache 2.0 license. Commercial managed versions are available through platforms like AWS (EMR, Glue) and Onehouse.

Does Hudi work with Snowflake?

Hudi tables stored in cloud storage can be queried by Snowflake through external tables. However, Snowflake has stronger native support for Apache Iceberg as its preferred open table format.

← Back to Glossary

Last updated: 2026-03-14

SR

Published by

Sainath Reddy

Data Engineer at Anblicks
🎯 4+ years experience