AdvancedLast updated: 2026-04-09 • 3 sections
Expert interview questions on Apache Iceberg tables in Snowflake, external volumes, catalog integration, open table formats, and interoperability with Spark/Trino.
Q: What are Snowflake-managed Iceberg tables and how do they differ from externally managed Iceberg tables?
Snowflake-managed Iceberg tables: Snowflake controls the Iceberg catalog, writes Parquet data files and Iceberg metadata to your external volume (S3/GCS/Azure). You get full DML (INSERT, UPDATE, DELETE, MERGE) and Snowflake-native features (Time Travel, cloning, streams). External engines (Spark, Trino) can read the data via the Iceberg REST catalog. Externally managed Iceberg tables: an external engine (Spark, AWS Glue) manages the catalog and data. Snowflake registers the table as read-only and queries it via the external catalog. You get read access but no DML, no Time Travel, no clustering. Choose Snowflake-managed when Snowflake is the primary compute; externally managed when Spark/Glue owns the pipeline.
Q: Explain the external volume concept. Why is it required for Iceberg tables?
An external volume is a Snowflake object that references a cloud storage location (S3 bucket, GCS bucket, ADLS container) with appropriate IAM trust relationships. Iceberg tables store data as open-format Parquet files + Iceberg metadata JSON on this external storage — unlike native Snowflake tables which store data in Snowflake's managed internal storage. The external volume is required because the entire point of Iceberg is open format interoperability: the data files must be accessible to non-Snowflake engines. Setup: CREATE EXTERNAL VOLUME my_vol STORAGE_LOCATIONS = (STORAGE_BASE_URL = 's3://my-bucket/iceberg/', STORAGE_PROVIDER = 'S3', STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::...:role/...').
Q: What Snowflake features work with Iceberg tables and which don't?
Works: INSERT/UPDATE/DELETE/MERGE, COPY INTO, Snowpipe, Time Travel (Snowflake-managed only), cloning, streams, tasks, Dynamic Tables reading from Iceberg, RBAC, masking policies, row access policies. Does NOT work: clustering (CLUSTER BY — Iceberg uses its own sort order), Search Optimization, materialized views on Iceberg, Snowflake-to-Snowflake sharing of Iceberg tables (must share via external catalog). Performance note: Iceberg tables are typically 10-20% slower than native Snowflake tables for queries due to Parquet scan overhead vs Snowflake's proprietary format. The tradeoff is interoperability.
Q: How does schema evolution work with Iceberg tables in Snowflake?
Iceberg supports rich schema evolution: ADD COLUMN, DROP COLUMN, RENAME COLUMN, and type widening (int → long, float → double) without rewriting data files. In Snowflake: ALTER ICEBERG TABLE ADD COLUMN works and is metadata-only (instant). For externally managed tables, schema changes made by Spark/Glue are automatically detected on the next catalog refresh in Snowflake. Critical: Iceberg handles schema evolution at the metadata level — old data files retain their original schema, and the engine maps columns by field IDs (not names). This means RENAME COLUMN is safe (unlike Parquet without Iceberg, where renames break column mapping).
Q: You're migrating a 10TB native Snowflake table to Iceberg format. Walk through the process and considerations.
Process: (1) Create external volume pointing to S3/GCS. (2) CREATE ICEBERG TABLE new_table ... CATALOG = 'SNOWFLAKE' EXTERNAL_VOLUME = 'my_vol' BASE_LOCATION = 'db/table/' AS SELECT * FROM native_table. (3) Validate row counts and checksums. (4) Update downstream consumers to reference the new table. (5) Set up streams/tasks or Dynamic Tables on the Iceberg table. Considerations: (1) 10TB CTAS will consume significant warehouse credits — use a LARGE+ warehouse. (2) Iceberg tables use more storage than native tables (Parquet is less compressed than Snowflake's format — expect ~30-50% more). (3) Time Travel retention starts fresh on the Iceberg table. (4) Test query performance — expect 10-20% slower queries. (5) If the table has CLUSTER BY, plan an Iceberg sort order to replace it. (6) If other engines need to read it, set up the Iceberg REST catalog endpoint.
Q: How does the Snowflake Open Catalog (Polaris) fit into the Iceberg ecosystem?
Polaris (now Apache project — previously Snowflake Open Catalog) is an open-source REST catalog implementation for Apache Iceberg. It provides a centralized catalog that multiple engines (Snowflake, Spark, Trino, Flink) can use to discover and manage Iceberg tables. With Polaris: (1) Snowflake registers as a catalog consumer/manager. (2) Spark jobs can read/write Iceberg tables that Snowflake also accesses. (3) Schema, partitioning, and snapshot metadata are shared across engines. Without Polaris: you need a Glue catalog, Hive Metastore, or Nessie catalog — each with different levels of Snowflake integration. Polaris simplifies multi-engine architectures by providing a single source of truth for table metadata.
Q: Can you use Iceberg tables with Snowflake streams and Dynamic Tables?
Streams: Yes, you can create streams on Snowflake-managed Iceberg tables. They work identically to streams on native tables — tracking changes via Snowflake's internal change tracking, not Iceberg's snapshot log. For externally managed Iceberg tables: streams are NOT supported (Snowflake can't track external changes). Dynamic Tables: Yes, a Dynamic Table can read FROM an Iceberg table as a source. The DT itself is a native Snowflake object (not Iceberg). You can also create a Dynamic Table AS an Iceberg table by specifying ICEBERG in the DDL — giving you both automatic refresh and open format output.
Default to native Snowflake tables unless you have a specific interoperability requirement. Native tables are faster, cheaper, and have full feature support. Use Iceberg if: multiple engines (Spark, Trino) need to access the same data, you're building a lakehouse architecture, or organizational policy requires open formats.
The Iceberg REST catalog is an API specification for discovering and managing Iceberg tables across engines. Snowflake can serve as a REST catalog endpoint, allowing Spark to discover tables that Snowflake manages. You need it when: non-Snowflake engines need to read your Snowflake-managed Iceberg tables. If only Snowflake accesses the data, you don't need it.