Snowflake Snowpark — Expert Interview Questions

AdvancedLast updated: 2026-04-09 • 3 sections

Expert interview questions on Snowpark DataFrame API, Python/Java/Scala UDFs, UDTFs, stored procedures, and when to use Snowpark vs pure SQL.

DataFrame API & Execution Model

Q: How does Snowpark's DataFrame API differ from PySpark's, and where does code actually execute?

Snowpark DataFrames are lazily evaluated, similar to PySpark, but with a critical difference: Snowpark translates DataFrame operations into SQL that executes on Snowflake's compute engine — not on a Spark cluster. Your Python code runs client-side (or in a stored procedure's sandbox), but the heavy computation happens on the warehouse. PySpark runs on a separate Spark cluster with its own executors. Implications: (1) Snowpark leverages Snowflake's optimizer, caching, and pruning. (2) You can't use arbitrary Python libraries in DataFrame transformations — only in UDFs/UDTFs. (3) DataFrame operations map to SQL (filter → WHERE, group_by → GROUP BY), so anything SQL can't express requires a UDF escape hatch.

Q: Explain lazy evaluation in Snowpark. What triggers execution, and how does this affect debugging?

Snowpark DataFrames build a logical plan (DAG of operations) without executing anything until an "action" method is called. Actions include: .collect(), .show(), .count(), .to_pandas(), .write.save_as_table(), .create_or_replace_view(). Transform methods (.filter(), .select(), .join(), .group_by()) are lazy. This means chaining 20 transformations is free until you trigger execution. Debugging implication: you can call df.explain() to see the generated SQL plan without executing it. Common mistake: calling .collect() in a loop (triggers N separate queries instead of one). For debugging, use df.queries to inspect the SQL Snowpark will generate.

Q: You need to apply a complex Python function (using pandas, scikit-learn) to each group of rows in a Snowflake table. How do you architect this in Snowpark?

Use a vectorized UDTF (User-Defined Table Function) with partitioning. Steps: (1) Define a UDTF class with process() and end_partition() methods. (2) Register it with session.udtf.register(). (3) Call it with: df.join_table_function(my_udtf(col('feature')).over(partition_by=col('group_key'))). The OVER(PARTITION BY) clause sends each group to a separate Python sandbox where scikit-learn/pandas run locally on that partition's data. Alternative: for simpler transformations, use a vectorized UDF (@udf with max_batch_size) that receives pandas Series and returns pandas Series. Key consideration: the Python sandbox has memory limits (~4-8GB depending on warehouse size) — partition your data small enough to fit in memory.

Q: What is the difference between a scalar UDF, a vectorized UDF, a UDTF, and a stored procedure in Snowpark?

Scalar UDF: takes single row input, returns single value. Called per-row in SQL (SELECT my_udf(col) FROM table). Vectorized UDF: same interface but receives/returns pandas Series in batches — 10-100x faster than scalar for Python UDFs due to reduced serialization overhead. Always prefer vectorized. UDTF: takes rows in, returns zero or more rows out (table function). Has process() per-row and end_partition() per-partition methods. Used for: exploding data, aggregations returning multiple rows, ML inference per group. Stored Procedure: runs a full program (Python/SQL/JavaScript) that can execute multiple SQL statements, manage transactions, and perform DDL. Called with CALL. Use for: ETL orchestration, admin tasks, multi-step workflows.

Q: How do you handle third-party Python packages (e.g., xgboost, requests) in Snowpark UDFs?

Three methods: (1) Anaconda channel — Snowflake includes popular packages (numpy, pandas, scikit-learn, xgboost) via the Anaconda partnership. Specify with @udf(packages=['xgboost==1.7.6']). Check availability in INFORMATION_SCHEMA.PACKAGES. (2) Stage upload — for packages not in Anaconda, upload wheel files to a stage and add with imports=['@stage/mypackage.zip']. (3) Conda environment — specify a full conda environment in the UDF definition. Gotchas: (1) packages must be pure Python or have pre-compiled binaries for Linux x86_64 (Snowflake's sandbox OS). (2) No network access from UDFs — you can't call external APIs (use external functions instead). (3) Package installation happens at UDF creation, not execution — first call is fast.

Production Patterns & Performance

Q: When should you use Snowpark Python vs pure SQL? Give specific criteria.

Use pure SQL when: (1) the transformation is expressible in SQL (joins, aggregations, window functions — covers 80%+ of data engineering work). (2) Performance is critical — SQL is always faster than Python UDFs due to zero serialization overhead. (3) The logic is simple enough that SQL is readable. Use Snowpark Python when: (1) ML inference/training (scikit-learn, xgboost models). (2) Complex string/regex processing that's awkward in SQL. (3) Calling Python libraries with no SQL equivalent (e.g., IP geolocation, custom parsers). (4) Building reusable pipeline frameworks where Python's abstraction capabilities are valuable. Anti-pattern: don't rewrite simple SQL as Snowpark DataFrames for "modernity" — it adds complexity without benefit.

Q: A vectorized UDF processing 100M rows is slow. How do you diagnose and optimize?

(1) Check warehouse size — Python UDFs run in sandboxes constrained by warehouse resources. Scale up to LARGE or XLARGE for more parallelism and memory. (2) Check batch size — use max_batch_size parameter to control how many rows per pandas batch. Too small = serialization overhead; too large = memory pressure. Start with 10,000 and tune. (3) Profile the Python code — is it CPU-bound (computation) or memory-bound (large objects per row)? (4) Check if the UDF can be replaced with SQL — even complex regex can sometimes be done with REGEXP_REPLACE. (5) Consider a UDTF with partitioning to parallelize across groups. (6) For ML models: serialize the model to a stage and load once in __init__, not in process(). (7) Avoid .collect() or .to_pandas() inside UDFs — everything should stay in Snowflake's engine.

Expert Snowpark Tips

Frequently Asked Questions

Can Snowpark replace PySpark for data engineering?

For Snowflake-centric workloads, largely yes. Snowpark handles 90%+ of what PySpark does for ETL: DataFrames, UDFs, window functions, joins. Advantages over PySpark: no Spark cluster to manage, automatic optimization by Snowflake, direct access to Snowflake data. Limitations: no RDD-level control, no custom partitioners, limited to Snowflake's execution engine. If your data lives in Snowflake, Snowpark eliminates the need for a separate Spark cluster.

Is Snowpark available in dbt?

Yes. dbt-snowflake supports Snowpark Python models (dbt v1.3+). You define a model as a Python file instead of SQL, and dbt executes it as a Snowpark stored procedure. Use for: ML feature engineering, complex transformations requiring Python. Keep in mind: Python models are slower than SQL models and don't support incremental materialization as cleanly.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersSnowflake Cortex AI & ML — Expert Interview QuestionsSnowflake Stored Procedures & UDFs — Expert Interview QuestionsSnowflake External Functions & Integrations — Expert Interview Questions
← All Cheat Sheets