Snowflake Semi-Structured Data — Expert Interview Questions

AdvancedLast updated: 2026-04-27 • 4 sections

Expert questions on VARIANT, OBJECT, ARRAY data types, FLATTEN, LATERAL, JSON/XML/Parquet handling, and schema-on-read patterns.

Key Facts for Semi-Structured Data Interviews

  • VARIANT: universal semi-structured type — holds any JSON value up to 16MB per cell.
  • Path notation: v:user:address:city — no casting needed for string access; use ::type for typed extraction.
  • FLATTEN: lateral view that explodes arrays and objects into rows — the primary tool for normalizing JSON.
  • LATERAL FLATTEN is shorthand for cross joining with a table function inline in the FROM clause.
  • Snowflake auto-detects and column-stores frequently used paths in VARIANT — performance is near-native column speed.
  • PARSE_JSON converts a string to VARIANT; OBJECT_CONSTRUCT / ARRAY_CONSTRUCT build from scalars.
  • GET_IGNORE_CASE: use when JSON keys have inconsistent casing across producers.

VARIANT, OBJECT, and ARRAY Types

Q: How does Snowflake store VARIANT data internally and why is it fast?

Snowflake stores VARIANT in an optimized columnar format. When JSON/Avro/Parquet loads into a VARIANT column, Snowflake analyzes the structure and stores frequently-occurring top-level paths as separate internal columns (columnar optimization). This means v:user:name can prune and scan like a native column — partition min/max metadata is maintained per JSON path. Queries on well-structured, consistent JSON schemas perform nearly as fast as native typed columns.

Q: When do you use VARIANT vs OBJECT vs ARRAY?

VARIANT: the practical default — holds any JSON value (object, array, scalar, null). Use for raw ingestion and flexible schemas. OBJECT: explicitly a key-value map. Use when you want to enforce that the column contains a JSON object and for clarity in UDF type signatures. ARRAY: an ordered list. Use for multi-value attributes where order matters. In practice, most teams use VARIANT everywhere because it handles all three cases; OBJECT and ARRAY add type safety and documentation intent.

Q: How do you query nested JSON paths efficiently in Snowflake?

Path access: v:user:address:city (colon-separated). Bracket notation for special chars or dynamic paths: v["event-type"]. Array element: v:items[0]:name. Deep nesting: v:a:b:c:d. Type casting: v:price::FLOAT, v:id::INT. Performance: frequently used paths get columnar optimization — extract them as typed virtual columns in a view for downstream use. For unknown schema: use FLATTEN + GET to dynamically discover keys.

Q: How do you handle schema evolution in semi-structured data ingestion?

Schema-on-read approach: store raw JSON in a VARIANT staging table, extract paths in downstream views or Dynamic Tables. Evolution is automatic — new keys are accessible via path notation immediately. For structured downstream tables: use OBJECT_CONSTRUCT + FLATTEN to extract known paths, add a catch-all VARIANT column for unknown/future fields. For breaking changes (key rename, type change): version the extraction layer as a new view rather than altering the raw table.

FLATTEN, LATERAL, and Advanced Patterns

Q: How does FLATTEN work and when do you use LATERAL FLATTEN?

FLATTEN is a table function that converts an array or object into a set of rows. SELECT VALUE FROM TABLE(FLATTEN(input => v:items)) returns one row per array element. LATERAL FLATTEN is the inline version: SELECT t.id, f.value FROM my_table t, LATERAL FLATTEN(input => t.tags) f — the LATERAL keyword allows the table function to reference the outer table column. Use FLATTEN for: normalizing arrays into rows, unpacking nested event arrays, exploding product tags or attributes.

Q: How do you aggregate and rebuild JSON structures in SQL?

Build objects: OBJECT_CONSTRUCT('key1', val1, 'key2', val2). Build arrays: ARRAY_AGG(col) to aggregate rows back into an array. Combine: SELECT OBJECT_CONSTRUCT('user_id', id, 'orders', ARRAY_AGG(order_id)) FROM orders GROUP BY id. For nested structures: OBJECT_CONSTRUCT can nest other OBJECT_CONSTRUCT and ARRAY_AGG calls. Use ARRAY_DISTINCT, ARRAY_SORT, ARRAY_SLICE for array manipulation.

Q: How do you load and query Parquet, Avro, and XML in Snowflake?

Parquet/Avro: create a file format (TYPE=PARQUET or TYPE=AVRO), load into a VARIANT column via COPY INTO or Snowpipe. Parquet columns map automatically to VARIANT sub-paths. XML: TYPE=XML in the file format; loads as a VARIANT with XMLGET and XPATH-style path access. For Parquet with a known schema: use the MATCH_BY_COLUMN_NAME option in COPY INTO to load directly into a typed table instead of VARIANT.

Q: What is STRIP_OUTER_ARRAY and when do you need it?

When a JSON file contains a single top-level array ([{...}, {...}]) and you want each element to load as a separate row (not the whole array as one VARIANT), set STRIP_OUTER_ARRAY=TRUE in the JSON file format. Without it, the entire file loads as a single VARIANT cell containing the array. With it, each object in the array becomes a separate row. This is the most common JSON ingestion issue when records come as array-wrapped batches.

Semi-Structured Data Interview Prep Checklist

Frequently Asked Questions

Should you store data as VARIANT or extract to typed columns?

Both, in layers. Ingest raw JSON into a VARIANT staging table for flexibility and resilience to schema changes. Extract frequently queried paths into typed columns in a downstream table or view for query performance and type safety. The two-layer pattern (raw VARIANT → typed extraction) is the production standard: the raw layer is the source of truth, the typed layer is the query-optimized serving layer.

How does Snowflake handle NULL vs missing keys in VARIANT?

JSON null: a JSON key explicitly set to null stores as a SQL NULL in Snowflake. Missing key: accessing a path that does not exist returns SQL NULL (not an error). This means IS NULL is true for both explicitly-null and missing paths. To distinguish: use TYPEOF(v:key) — returns 'null_value' for JSON null and NULL for a missing path. Important for data quality: always check TYPEOF in data validation to differentiate absent from explicitly-nulled fields.

What are the performance limits of VARIANT queries on large tables?

Columnar optimization covers frequently occurring, consistently named top-level paths — these query at near-native column speed. Deeply nested paths (4+ levels) and infrequently occurring keys are stored as raw bytes and scanned linearly — slower. FLATTEN on large arrays generates high row counts and can be memory-intensive. Mitigation: extract the most queried paths into a typed materialized table, use FLATTEN only in staging pipelines not in BI-facing queries, cluster on frequently filtered VARIANT path values by extracting them to a real column.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersSnowflake Stored Procedures & UDFs — Expert Interview QuestionsSnowflake Query Tuning — Expert Interview Questions
← All Cheat Sheets