Snowflake Semi-Structured Data — Expert Interview Questions

AdvancedLast updated: 2026-04-09 • 3 sections

Expert questions on VARIANT, OBJECT, ARRAY data types, FLATTEN, LATERAL, JSON/XML/Parquet handling, and schema-on-read patterns.

VARIANT and Data Types

Q: How does Snowflake store semi-structured data internally?

Snowflake stores VARIANT data in an optimized columnar format. When JSON/Avro/Parquet is loaded into a VARIANT column, Snowflake automatically analyzes the structure and stores commonly-occurring paths as individual columns internally (columnar optimization). This means JSON queries like v:user:name can prune and scan efficiently, similar to native columns. VARIANT columns support up to 16MB per value. Performance: well-structured JSON with consistent schemas performs nearly as fast as native columns due to this optimization.

Q: VARIANT vs OBJECT vs ARRAY: when to use each?

VARIANT: the universal semi-structured type. Holds any JSON value (object, array, string, number, null). Use for: raw ingestion, flexible schemas. OBJECT: specifically a key-value map (JSON object). Use for: structured metadata, configuration. ARRAY: ordered list of values. Use for: tags, multi-value attributes. In practice, most teams use VARIANT for everything because it handles all three. OBJECT and ARRAY are useful for: explicit typing in UDFs, clearer schema intent in table definitions, and type checking (IS_OBJECT, IS_ARRAY).

Q: How do you query nested JSON efficiently?

Path notation: SELECT v:user:address:city FROM table. Bracket notation for special chars: v["user-name"]. For arrays: v:items[0]:name (first element). For deep nesting: chain paths v:a:b:c:d. Performance tips: (1) Create views that extract common paths as typed columns for downstream consumers. (2) Use :: for casting: v:price::NUMBER. (3) Filter on extracted paths: WHERE v:status::STRING = "active" leverages micro-partition pruning on the internal columnar representation. (4) For repeated queries on the same paths, materialize into native columns.

Q: What happens when you load Parquet vs JSON vs Avro into Snowflake?

JSON: loaded into VARIANT column (or specific columns via COPY INTO with column mapping). Nested structures preserved. Avro: schema is used to map to Snowflake columns automatically if column names match; otherwise lands in VARIANT ($1). Schemas from Confluent Schema Registry can auto-map. Parquet: columnar format, Snowflake reads column metadata and can map directly to table columns. Most efficient for large datasets. All three: support schema-on-read (load into VARIANT first, extract later) or schema-on-write (map during COPY INTO).

FLATTEN and Advanced Patterns

Q: Explain FLATTEN and when you need LATERAL.

FLATTEN converts a VARIANT array or object into rows (one row per element). Usage: SELECT f.value:name FROM table, LATERAL FLATTEN(input => v:items) f. LATERAL is required because FLATTEN references a column from the left table (the VARIANT column). Without LATERAL, the subquery cannot reference the outer table. Key parameters: input (the array/object), path (sub-path within), outer (TRUE = include rows even when array is empty/null, like LEFT JOIN), recursive (TRUE = flatten nested structures), mode (OBJECT/ARRAY/BOTH).

Q: How to handle deeply nested JSON with arrays of objects containing arrays?

Chain FLATTEN calls: SELECT t.id, f1.value:name::STRING AS item_name, f2.value::STRING AS tag FROM table t, LATERAL FLATTEN(input => t.v:orders) f1, LATERAL FLATTEN(input => f1.value:tags) f2. Each FLATTEN produces rows that multiply with the next. For complex structures, use CTEs to flatten one level at a time for readability. Performance consideration: chained FLATTEN on large arrays can produce massive row explosions. Add WHERE filters early to reduce intermediate rows.

Q: Schema-on-read vs materializing into native columns: when each?

Schema-on-read (query VARIANT directly): when schema changes frequently, during exploration/prototyping, for ad-hoc analytics on raw data. Materialize (extract into typed columns via Dynamic Table or view): when schema is stable, for production dashboards (faster queries), for downstream tools that need typed columns (BI tools), when you need clustering on extracted values. Hybrid pattern: land raw in VARIANT, materialize hot-path columns into a typed Dynamic Table, keep VARIANT for ad-hoc exploration.

Semi-Structured Tips

Frequently Asked Questions

Maximum VARIANT size?

16MB per VARIANT value. For larger documents, split into chunks before loading. Most JSON records are well under this limit.

Can you index VARIANT columns?

No traditional indexes, but: (1) Search Optimization Service supports VARIANT path expressions. (2) Snowflake auto-optimizes internal columnar storage for common paths. (3) Clustering on extracted expressions (CLUSTER BY (v:date::DATE)) works for pruning.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersSnowflake Query Tuning — Expert Interview QuestionsSnowpipe Streaming & Kafka — Expert Interview Questions
← All Cheat Sheets