🏢 Data Warehousing

Apache Iceberg

An open table format for huge analytic datasets that brings warehouse-like features (ACID transactions, time travel, schema evolution) to data lakes.

Apache Iceberg is an open table format designed for petabyte-scale analytic datasets. Created at Netflix, it solves the fundamental problems that made data lakes unreliable — by bringing ACID transactions, time travel, schema evolution, and partition evolution to files sitting in object storage (S3, GCS, ADLS).

The Problem Iceberg Solves

Traditional data lakes stored data as raw files (Parquet/ORC) with a Hive Metastore tracking partitions. This architecture had critical flaws:

- No ACID: Concurrent writes could corrupt data
- Partition Lock-in: Changing partition schemes required rewriting all data
- No Time Travel: Can't query historical states
- Schema Rigidity: Adding columns was painful
- Small File Problem: Too many small files destroyed performance

Iceberg fixes ALL of these.

How Iceberg Works

``
Query Engine (Spark/Trino/Flink)


┌─────────────────┐
│ Iceberg Catalog │ (Where is the latest table state?)
│ (REST/Glue/Nessie)│
└────────┬────────┘

┌────────▼────────┐
│ Metadata Layer │ (Manifest files + Manifest Lists)
│ - Snapshots │
│ - Manifests │
│ - Statistics │
└────────┬────────┘

┌────────▼────────┐
│ Data Files │ (Parquet/ORC/Avro in S3/GCS/ADLS)
└─────────────────┘
`

Key Features

ACID Transactions


- Atomic commits: writes either fully succeed or fully fail
- Snapshot isolation for concurrent reads and writes
- No more corrupted tables from failed jobs

Time Travel


`sql
-- Query data as it was 24 hours ago
SELECT * FROM orders
FOR SYSTEM_TIME AS OF TIMESTAMP '2026-02-26 09:00:00';

-- Roll back to a previous snapshot
ALTER TABLE orders ROLLBACK TO SNAPSHOT 12345;
``

Schema Evolution


- Add, drop, rename, or reorder columns without rewriting data
- Type promotion (int → long, float → double) is safe
- No downtime for schema changes

Partition Evolution


- Change partition schemes without rewriting data
- Start with daily partitions, switch to hourly — Iceberg handles it
- Hidden partitioning: users don't need to know partition columns

Engine Compatibility


Iceberg tables can be read/written by multiple engines simultaneously:
- Apache Spark, Trino/Presto, Flink, Snowflake, BigQuery, Dremio, StarRocks

Iceberg vs Delta Lake vs Hudi

| Feature | Iceberg | Delta Lake | Hudi |
|---------|---------|------------|------|
| Governance | Apache Foundation | Databricks | Apache Foundation |
| Engine Lock-in | None | Spark-first | Spark-first |
| Partition Evolution | ✅ Best-in-class | ❌ | Partial |
| Catalog Options | REST, Glue, Nessie, Polaris | Unity Catalog | Hive Metastore |
| Adoption Trend | 📈 Fastest growing | 📈 Strong (Databricks) | ➡️ Stable |

Key Points

Frequently Asked Questions

What is Apache Iceberg used for?

Apache Iceberg is used to manage large analytical datasets in data lakes. It provides warehouse-like features (ACID transactions, time travel, schema evolution) to files stored in S3, GCS, or ADLS.

Is Apache Iceberg better than Delta Lake?

Iceberg is more engine-agnostic and has superior partition evolution. Delta Lake is tightly integrated with Databricks and Spark. The best choice depends on your stack — Iceberg for multi-engine environments, Delta for Databricks shops.

Does Snowflake support Apache Iceberg?

Yes. Snowflake supports Iceberg Tables natively, allowing you to query and manage Iceberg-formatted data in S3/GCS/ADLS directly from Snowflake. This is a core part of Snowflake's open data strategy.

What is the difference between Iceberg and Parquet?

Parquet is a file format (how data is stored). Iceberg is a table format (how files are organized and managed). Iceberg typically uses Parquet files underneath but adds metadata, transactions, and versioning on top.

← Back to Glossary

Last updated: 2026-02-27

SR

Published by

Sainath Reddy

Data Engineer at Anblicks
🎯 4+ years experience