Snowflake Streams & Tasks — Expert Interview Questions

AdvancedLast updated: 2026-04-09 • 3 sections

Expert-level interview questions on Snowflake Streams (CDC), Tasks, task graphs, and change data capture patterns. Covers offset tracking, CHANGES clause, and production DAG alternatives.

Streams — Change Data Capture

Q: How does a Snowflake stream track changes internally, and what happens if you query a stream without consuming it?

A stream stores a pair of offsets pointing to the transaction log (micro-partition change journal). It records the position at which it was last consumed (DML committed in the same transaction) and the current table version. Querying a stream (SELECT) is non-destructive — it shows pending changes but does NOT advance the offset. Only a DML statement (INSERT INTO … SELECT FROM stream) inside a committed transaction advances the offset. If no DML consumes the stream, changes accumulate until the retention period expires, at which point the stream becomes stale and must be recreated.

Q: Explain the difference between standard, append-only, and insert-only streams. When would you choose each?

Standard streams capture INSERTs, UPDATEs, and DELETEs with METADATA$ACTION and METADATA$ISUPDATE columns — use for full CDC on mutable tables. Append-only streams capture only INSERT operations and ignore UPDATEs/DELETEs — ideal for immutable event/log tables where you only care about new rows (more efficient, smaller change set). Insert-only streams are specifically for external tables — they track new files added to the external stage. Choose standard for SCD-2 patterns, append-only for event sourcing, insert-only for external table ingestion pipelines.

Q: A stream on a 500M-row table shows 200M changes after a weekend. The downstream MERGE takes 45 minutes. How do you optimize this?

Several approaches: (1) Increase task frequency — run every 5 min instead of hourly to keep change sets small. (2) Use a dedicated warehouse for the MERGE — right-size it (MEDIUM or LARGE) to avoid spilling. (3) Partition the MERGE by adding a WHERE clause on a date/region column so each run processes a subset. (4) If most changes are inserts, switch to an append-only stream + INSERT (skip MERGE overhead). (5) Check if the 200M changes include cascading updates from a single bulk operation — if so, consider a full table rebuild (CTAS) instead of MERGE when changes exceed ~30% of table size. (6) For the MERGE itself, ensure the join key is well-clustered on the target table.

Q: How do you handle streams on tables with schema evolution (ALTER TABLE ADD COLUMN)?

Streams survive ALTER TABLE ADD COLUMN — the stream automatically includes the new column in its output. However, if the downstream consumer (INSERT INTO target SELECT FROM stream) has a fixed column list, it will break. Best practice: use INSERT INTO target SELECT * or dynamically generate the column list. For DROP COLUMN, the stream reflects the removal. Critical gotcha: RECREATING a table (CREATE OR REPLACE TABLE) invalidates all streams on that table — they become stale. Use ALTER TABLE for schema changes, never CREATE OR REPLACE on tables with active streams.

Q: Can you create a stream on a view? What about a stream on a shared table?

Yes, you can create a stream on a secure view or regular view (called a "stream on view"), which tracks changes to the underlying base tables. This is powerful for multi-table CDC — create a view joining multiple tables and stream on it. For shared tables (via Snowflake Data Sharing): you CANNOT create a stream directly on a shared table in the consumer account. Workaround: create a task that copies new data from the share into a local table, then stream on the local table. Alternatively, use the CHANGES clause on the shared table if change tracking is enabled by the provider.

Q: Explain the CHANGES clause and how it differs from streams.

The CHANGES clause (SELECT * FROM table CHANGES(INFORMATION => DEFAULT) AT(OFFSET => -3600)) provides ad-hoc CDC without creating a stream object. It uses the same internal change tracking but doesn't maintain persistent offsets. Key differences: (1) CHANGES is stateless — you must manage offsets yourself (e.g., store last-processed timestamp). (2) Streams are stateful — they auto-track consumption position. (3) CHANGES requires change tracking enabled (ALTER TABLE SET CHANGE_TRACKING = TRUE). (4) Streams implicitly enable change tracking. Use CHANGES for one-off auditing; use streams for production CDC pipelines.

Tasks & Task Graphs

Q: Design a task graph that processes three streams into a single target table with exactly-once guarantees.

Create a root task with WHEN SYSTEM$STREAM_HAS_DATA('stream_a') OR SYSTEM$STREAM_HAS_DATA('stream_b') OR SYSTEM$STREAM_HAS_DATA('stream_c'). The root task MERGEs stream_a into the target. Create two child tasks: child_1 AFTER root merges stream_b, child_2 AFTER child_1 merges stream_c. All three MERGEs run in the same task graph execution, which is a single transaction scope. If any child fails, the entire graph retries (configurable). Exactly-once is guaranteed because stream offsets only advance on successful transaction commit — if the graph fails mid-way, streams still show the unconsumed changes on retry.

Q: What is the difference between a serverless task and a user-managed warehouse task? When do you choose each?

User-managed tasks run on a specified warehouse (WAREHOUSE = wh_name) — you control size and pay per-second of warehouse uptime. Serverless tasks (no WAREHOUSE clause) use Snowflake-managed compute — Snowflake auto-sizes the compute, you pay per-second of actual execution time. Choose serverless for: lightweight/short tasks (< 1 min), variable workloads, simplicity. Choose user-managed for: heavy transformations, consistent workloads where a dedicated warehouse is already running, cost predictability. Serverless tasks have a minimum 1-minute billing increment and can be more expensive for frequent tiny tasks.

Q: A task graph has 5 tasks forming a diamond dependency (A → B, A → C, B → D, C → D, D → E). Task C fails. What happens to D and E?

When C fails, task D will NOT execute because D depends on both B and C (AFTER B, C). Even though B succeeded, D requires ALL predecessors to succeed. Task E also won't run since it depends on D. The task graph execution is marked as failed. On the next scheduled run, the entire graph starts fresh from task A — there's no "resume from failure" capability. To handle partial failures: (1) add SUSPEND_TASK_AFTER_NUM_FAILURES to auto-pause after repeated failures, (2) set ERROR_INTEGRATION for SNS/email alerts, (3) make tasks idempotent so restarts are safe.

Q: How do you monitor and debug task failures in production?

Multiple approaches: (1) TASK_HISTORY() table function — shows execution status, error messages, duration, scheduled vs actual time. Query: SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) WHERE STATE = 'FAILED' ORDER BY SCHEDULED_TIME DESC. (2) TASK_DEPENDENTS() — shows the task graph structure. (3) ERROR_INTEGRATION — sends failure notifications to an SNS topic or email. (4) QUERY_HISTORY filtered by QUERY_TAG (tasks auto-tag their queries). (5) Set up a monitoring task that queries TASK_HISTORY every hour and alerts on failures via external function. For debugging, check the error_message column in TASK_HISTORY and correlate with QUERY_HISTORY for the actual SQL error.

Q: Can tasks call stored procedures? How do you pass parameters between tasks in a graph?

Yes, a task's SQL can be CALL my_procedure(args). For passing data between tasks: Snowflake tasks do NOT have a native parameter-passing mechanism like Airflow XComs. Workarounds: (1) Write output to a control/metadata table — downstream tasks read from it. (2) Use session variables (SET/UNSET) within a stored procedure that orchestrates multiple steps. (3) Use stream-based patterns where task A writes to a table, task B reads the stream on that table. (4) For simple status passing, use a RESULT_SCAN pattern within a stored procedure. The lack of native inter-task data passing is a key reason many teams use Airflow/dbt for complex orchestration and reserve tasks for simple CDC patterns.

Expert Tips for Streams & Tasks

Frequently Asked Questions

What is the maximum number of tasks in a task graph?

A task graph (DAG) can have up to 1,000 tasks. However, practical graphs rarely exceed 10-20 tasks. Each task can have multiple predecessors (up to 100 AFTER dependencies). The graph must be a DAG — no cycles allowed. Tip: if your graph exceeds ~15 tasks, consider whether Airflow or dbt would be a better orchestration choice.

Can I use streams and tasks for real-time data processing?

Near-real-time, not truly real-time. Tasks have a minimum schedule interval of 1 minute. Combined with stream consumption and MERGE execution time, end-to-end latency is typically 2-5 minutes. For sub-second latency, use Snowpipe Streaming or Kafka directly. Streams + tasks are best for "micro-batch" CDC patterns where 1-5 minute latency is acceptable.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersSnowflake Dynamic Tables — Expert Interview QuestionsSnowpipe Streaming & Kafka — Expert Interview Questions
← All Cheat Sheets