AdvancedLast updated: 2026-04-09 • 3 sections
Expert questions on VARIANT, OBJECT, ARRAY data types, FLATTEN, LATERAL, JSON/XML/Parquet handling, and schema-on-read patterns.
Q: How does Snowflake store semi-structured data internally?
Snowflake stores VARIANT data in an optimized columnar format. When JSON/Avro/Parquet is loaded into a VARIANT column, Snowflake automatically analyzes the structure and stores commonly-occurring paths as individual columns internally (columnar optimization). This means JSON queries like v:user:name can prune and scan efficiently, similar to native columns. VARIANT columns support up to 16MB per value. Performance: well-structured JSON with consistent schemas performs nearly as fast as native columns due to this optimization.
Q: VARIANT vs OBJECT vs ARRAY: when to use each?
VARIANT: the universal semi-structured type. Holds any JSON value (object, array, string, number, null). Use for: raw ingestion, flexible schemas. OBJECT: specifically a key-value map (JSON object). Use for: structured metadata, configuration. ARRAY: ordered list of values. Use for: tags, multi-value attributes. In practice, most teams use VARIANT for everything because it handles all three. OBJECT and ARRAY are useful for: explicit typing in UDFs, clearer schema intent in table definitions, and type checking (IS_OBJECT, IS_ARRAY).
Q: How do you query nested JSON efficiently?
Path notation: SELECT v:user:address:city FROM table. Bracket notation for special chars: v["user-name"]. For arrays: v:items[0]:name (first element). For deep nesting: chain paths v:a:b:c:d. Performance tips: (1) Create views that extract common paths as typed columns for downstream consumers. (2) Use :: for casting: v:price::NUMBER. (3) Filter on extracted paths: WHERE v:status::STRING = "active" leverages micro-partition pruning on the internal columnar representation. (4) For repeated queries on the same paths, materialize into native columns.
Q: What happens when you load Parquet vs JSON vs Avro into Snowflake?
JSON: loaded into VARIANT column (or specific columns via COPY INTO with column mapping). Nested structures preserved. Avro: schema is used to map to Snowflake columns automatically if column names match; otherwise lands in VARIANT ($1). Schemas from Confluent Schema Registry can auto-map. Parquet: columnar format, Snowflake reads column metadata and can map directly to table columns. Most efficient for large datasets. All three: support schema-on-read (load into VARIANT first, extract later) or schema-on-write (map during COPY INTO).
Q: Explain FLATTEN and when you need LATERAL.
FLATTEN converts a VARIANT array or object into rows (one row per element). Usage: SELECT f.value:name FROM table, LATERAL FLATTEN(input => v:items) f. LATERAL is required because FLATTEN references a column from the left table (the VARIANT column). Without LATERAL, the subquery cannot reference the outer table. Key parameters: input (the array/object), path (sub-path within), outer (TRUE = include rows even when array is empty/null, like LEFT JOIN), recursive (TRUE = flatten nested structures), mode (OBJECT/ARRAY/BOTH).
Q: How to handle deeply nested JSON with arrays of objects containing arrays?
Chain FLATTEN calls: SELECT t.id, f1.value:name::STRING AS item_name, f2.value::STRING AS tag FROM table t, LATERAL FLATTEN(input => t.v:orders) f1, LATERAL FLATTEN(input => f1.value:tags) f2. Each FLATTEN produces rows that multiply with the next. For complex structures, use CTEs to flatten one level at a time for readability. Performance consideration: chained FLATTEN on large arrays can produce massive row explosions. Add WHERE filters early to reduce intermediate rows.
Q: Schema-on-read vs materializing into native columns: when each?
Schema-on-read (query VARIANT directly): when schema changes frequently, during exploration/prototyping, for ad-hoc analytics on raw data. Materialize (extract into typed columns via Dynamic Table or view): when schema is stable, for production dashboards (faster queries), for downstream tools that need typed columns (BI tools), when you need clustering on extracted values. Hybrid pattern: land raw in VARIANT, materialize hot-path columns into a typed Dynamic Table, keep VARIANT for ad-hoc exploration.
16MB per VARIANT value. For larger documents, split into chunks before loading. Most JSON records are well under this limit.
No traditional indexes, but: (1) Search Optimization Service supports VARIANT path expressions. (2) Snowflake auto-optimizes internal columnar storage for common paths. (3) Clustering on extracted expressions (CLUSTER BY (v:date::DATE)) works for pruning.