Data Engineering Interview Questions & Answers

AdvancedLast updated: 2026-04-09 • 4 sections

Comprehensive guide to data engineering interview questions covering ETL pipelines, data modeling, orchestration, cloud platforms, and system design.

ETL & Pipeline Design

Q: What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading into the warehouse — common with on-prem systems where compute was limited. ELT (Extract, Load, Transform) loads raw data first, then transforms using warehouse compute (Snowflake, BigQuery). ELT is the modern standard because cloud warehouses offer cheap, elastic compute. ELT also preserves raw data for reprocessing and simplifies pipeline architecture.

Q: How do you handle schema evolution in a data pipeline?

Strategies: (1) Schema-on-read with VARIANT/JSON columns (Snowflake handles this well). (2) dbt with on_schema_change='sync_all_columns' for incremental models. (3) Schema registries (Confluent for Kafka) to enforce compatibility. (4) Landing raw data as-is, then applying schema validation in a staging layer. The key principle: never let a schema change break downstream consumers. Use additive changes (add columns) and avoid destructive changes (rename/drop columns).

Q: Design a real-time data pipeline for an e-commerce platform.

Architecture: (1) Event capture with Kafka/Kinesis from application events (clicks, orders, payments). (2) Stream processing with Flink or Kafka Streams for real-time transformations (dedup, enrichment, sessionization). (3) Dual sink: real-time layer (Redis/DynamoDB for live dashboards) + batch layer (S3/GCS → Snowpipe → Snowflake for analytics). (4) dbt for batch transformations on the warehouse side. (5) Orchestration with Airflow for batch ETL; event-driven for streaming. Key considerations: idempotency, exactly-once semantics, dead-letter queues for failures.

Q: How do you ensure data quality in production pipelines?

Multi-layered approach: (1) Schema validation at ingestion (reject malformed records). (2) dbt tests on every model: not_null, unique, accepted_values, relationships. (3) Great Expectations or dbt-expectations for statistical tests (row count anomalies, distribution shifts). (4) Freshness checks (dbt source freshness, Airflow sensors). (5) Alerting on failures via Slack/PagerDuty. (6) Data contracts between producers and consumers defining expected schemas and SLAs.

Data Modeling

Q: Explain the difference between star schema and snowflake schema.

Star schema: a central fact table (transactions, events) surrounded by denormalized dimension tables (customers, products, dates). Simple, fast for queries, easy to understand. Snowflake schema: dimensions are normalized into sub-dimensions (country → region → city). Saves storage but adds join complexity. Modern recommendation: use star schema for analytics warehouses. Storage is cheap; query performance and simplicity matter more.

Q: What is a slowly changing dimension (SCD)? Explain Type 1 vs Type 2.

SCD tracks how dimension attributes change over time. Type 1: overwrite the old value (no history). Simple but loses historical context. Type 2: insert a new row with effective dates (valid_from, valid_to) and a current flag. Preserves full history for auditing and point-in-time analysis. Type 3: add a "previous value" column — tracks one level of history. In practice, use Type 2 for important dimensions (customers, products) and Type 1 for non-critical attributes.

Q: What are the layers in a modern data warehouse (medallion architecture)?

Bronze (raw/staging): raw data as-is from sources, minimal transformations. Silver (intermediate/cleaned): cleaned, deduplicated, typed data with business logic applied. Gold (marts/presentation): aggregated, business-ready models optimized for consumers (BI tools, ML features). This maps to the dbt convention: staging/ → intermediate/ → marts/. Benefits: reprocessable from raw data, clear data lineage, separation of concerns.

System Design Essentials

Key Insights for Interviews

Frequently Asked Questions

What tools should I know for a data engineering interview?

Core stack: SQL (Snowflake/BigQuery/Redshift), Python, dbt, Airflow/Dagster, and one cloud platform (AWS/GCP/Azure). Nice to have: Kafka, Spark, Terraform, Docker. Know the "why" behind each tool — interviewers value architectural thinking over tool-name-dropping.

How should I prepare for a system design round?

Practice designing: (1) a real-time analytics pipeline, (2) a batch ETL for a data warehouse, (3) a CDC pipeline, (4) a data quality monitoring system. Use the framework: Requirements → Data flow → Storage → Compute → Orchestration → Monitoring. Draw diagrams and discuss trade-offs at each decision point.

Related Cheat Sheets

Top 30 Snowflake Interview Questions & AnswersTop 25 SQL Interview Questions for Data Engineersdbt Best Practices for Production Projects
← All Cheat Sheets