Snowpipe Streaming & Kafka — Expert Interview Questions

AdvancedLast updated: 2026-04-27 • 4 sections

Expert questions on Snowpipe Streaming, Kafka connector, classic Snowpipe, channels, and real-time ingestion patterns.

Key Facts for Streaming Ingestion Interviews

  • Classic Snowpipe: file-based, SQS notification, COPY INTO, 30–120s latency. Streaming: row-based SDK, 1–10s latency.
  • Snowpipe Streaming uses the Java Ingest SDK; channels provide ordered, exactly-once delivery per partition.
  • Kafka Connector Streaming mode (SNOWPIPE_STREAMING) maps each Kafka partition to one channel.
  • Exactly-once semantics: insertRows(batch, offsetToken) is atomic; on recovery call getLatestCommittedOffsetToken().
  • Streaming rows land in a temporary micro-partition buffer; final consolidation adds a few seconds.
  • For 50K events/sec, plan for tasks.max ≈ 1 per 10K eps with Kafka Connector Streaming.
  • Schema evolution: set schemaEvolutionMode=COLUMN_NULLABLE to auto-add nullable columns from Kafka records.

Streaming Architecture

Q: What is the difference between classic Snowpipe and Snowpipe Streaming?

Classic Snowpipe: file-based. Files land in a stage, SQS/SNS notification triggers a COPY INTO, latency 30–120s, best for large batches. Snowpipe Streaming: row-based Java Ingest SDK, insertRows() pushes data directly, latency 1–10s, no files or stages, billed per row not per file. Streaming eliminates the small-file performance penalty for high-frequency, low-volume payloads.

Q: How does the Kafka Connector work in Streaming mode?

Set ingestion.method=SNOWPIPE_STREAMING in the connector config. Each Kafka partition maps to one Snowflake channel for ordered, in-sequence delivery. The connector buffers records, calls insertRows() with an offsetToken per batch. If the connector restarts, it calls getLatestCommittedOffsetToken() to resume from the correct offset, achieving exactly-once semantics.

Q: How does exactly-once delivery work in Snowpipe Streaming?

The channel tracks offsets atomically with data. insertRows(rows, offsetToken) stores the rows and the offset in one atomic operation. On crash/restart: call getLatestCommittedOffsetToken() to retrieve the last committed offset, then resume producing from offset + 1. A crash before the commit results in at-most-once — handle this with idempotent downstream processing or deduplication.

Q: How would you design an ingestion pipeline for 50K events/sec from Kafka?

50K eps at ~1KB = 50MB/s. Use Kafka Connect Streaming mode with tasks.max=6 (roughly 10K eps per task). Configure flush.time=10s and count.records=10000 for batching. CLUSTER BY ingestion timestamp on the target table. Monitor CLIENT_HISTORY for lag. Downstream consumption: Dynamic Tables for latency-tolerant transforms, streams + tasks for sub-minute downstream processing.

Q: When should you choose Streaming vs classic Snowpipe vs COPY INTO?

Streaming: sub-30s latency requirement, continuous row-level arrival, small payloads (IoT, clickstream). Classic Snowpipe: file-based arrival (S3/GCS/Azure), large batches, 1–3 min latency acceptable. COPY INTO: one-time bulk loads and backfills, full control over copy options. Streaming avoids the small-file consolidation cost that hurts classic Snowpipe at high frequency.

Advanced Patterns and Operations

Q: How do you handle schema evolution with the Kafka Streaming connector?

Set schemaEvolutionMode=COLUMN_NULLABLE in the connector config. When a new field arrives in the Kafka record, the connector automatically issues an ALTER TABLE to add the column as NULLABLE. The column backfills with NULL for prior rows. For breaking changes (column type change, column removal), handle outside the connector via a migration strategy.

Q: How do you monitor and troubleshoot Snowpipe Streaming lag?

Check SNOWFLAKE.ACCOUNT_USAGE.PIPE_USAGE_HISTORY for classic Snowpipe. For Streaming: SNOWFLAKE.ACCOUNT_USAGE.TASK_HISTORY if using downstream tasks, and client-side metrics via the SDK telemetry. Kafka-side: monitor consumer_lag via Kafka Connect JMX metrics. Lag causes: under-provisioned tasks.max, network throttling, warehouse contention on downstream transforms.

Q: What are channels in Snowpipe Streaming and how do they affect ordering?

A channel is a named, ordered sequence of rows bound to a single table. Rows within a channel are guaranteed to be delivered in insertion order. Channels are not ordered relative to each other. For Kafka: each partition maps to one channel, preserving Kafka partition ordering. Multiple channels on the same table = parallel ingestion with no cross-channel order guarantee. Design: if end-to-end ordering is required, use a single partition + channel.

Q: How does Snowpipe Streaming billing differ from classic Snowpipe?

Classic Snowpipe: billed on file ingestion — per-file overhead + compute for COPY INTO. Snowpipe Streaming: billed per row (or per credit for the streaming serverless compute). Streaming is more cost-effective for: frequent small payloads, high-frequency low-volume data. Classic Snowpipe wins for: large batch files, bulk historical loads. Compare both models for your event volume and size distribution.

Streaming Ingestion Interview Prep Checklist

Frequently Asked Questions

Can Snowpipe Streaming guarantee exactly-once delivery end to end?

Snowpipe Streaming guarantees exactly-once at the ingestion layer via channel offset tracking. End-to-end exactly-once also requires idempotent downstream consumers — if a task or Dynamic Table processes a row twice due to a retry, the downstream result must be the same. For true end-to-end exactly-once, combine channel offsets with a deduplication key in the target table.

What happens to in-flight streaming data if Snowflake has an outage?

In-flight rows not yet committed are held in the Snowpipe Streaming buffer. On recovery, the SDK resumes from the last committed offset. For Kafka Connector, the connector restarts and calls getLatestCommittedOffsetToken() to determine the resume point. Kafka itself retains unconsumed messages, so no data is lost as long as Kafka retention covers the outage window.

When should you use Snowpipe Streaming vs Kafka Direct Ingest vs Spark Streaming?

Snowpipe Streaming: best when Snowflake is the primary destination, low latency needed, and you want minimal infrastructure. Kafka Direct Ingest (Kafka Connector Streaming): best when Kafka is already your event bus and you need fan-out to multiple consumers. Spark Streaming: best when you need complex event processing, aggregations, or enrichment before landing data in Snowflake. Cost and operational complexity increase in the same order.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersSnowflake External Functions & Integrations — Expert Interview QuestionsSnowflake Query Tuning — Expert Interview Questions
← All Cheat Sheets