Data Modeling Cheat Sheet — Dimensional, SCD, Data Vault

IntermediateLast updated: 2026-04-16 • 6 sections

Dimensional modeling, star vs snowflake schema, SCD Types 1/2/3/4, Data Vault 2.0, One Big Table, and choosing the right model for your warehouse.

Modeling Approaches Compared

Approach	Best For	Key Downside
Star Schema (Kimball)	`BI / reporting, dashboards`	Slower to evolve when sources change
Snowflake Schema	Normalized dims save storage	More joins, slower queries
Data Vault 2.0	`Enterprise DW with many sources, auditability`	`Verbose, hard for BI without a mart on top`
3NF / Inmon	`Atomic enterprise DW, then build marts`	Heavy upfront design, slow to deliver
One Big Table (OBT)	Columnar warehouses (Snowflake, BigQuery), simple dashboards	Wide tables, duplicated dim data
Activity Schema	Product analytics, event-driven orgs	`Few BI tools natively support it`
Medallion (Bronze/Silver/Gold)	Databricks lakehouse layering	Orthogonal to modeling — not a replacement

Star Schema Example (Kimball)

-- Fact table: grain = one row per order line item
CREATE TABLE fct_order_lines (
  order_line_id   BIGINT PRIMARY KEY,
  order_id        BIGINT,
  order_date_key  INT        REFERENCES dim_date(date_key),
  customer_key    BIGINT     REFERENCES dim_customer(customer_key),
  product_key     BIGINT     REFERENCES dim_product(product_key),
  store_key       BIGINT     REFERENCES dim_store(store_key),
  quantity        INT,
  unit_price      DECIMAL(10,2),
  line_amount     DECIMAL(12,2),
  discount_amount DECIMAL(12,2)
);

-- Conformed dimension (SCD Type 2)
CREATE TABLE dim_customer (
  customer_key    BIGINT PRIMARY KEY,        -- surrogate
  customer_id     BIGINT,                    -- natural / business
  full_name       VARCHAR,
  email           VARCHAR,
  country         VARCHAR,
  tier            VARCHAR,
  effective_from  DATE,
  effective_to    DATE,                      -- NULL or 9999-12-31 for current
  is_current      BOOLEAN
);

-- Date dimension (build once, reuse across all facts)
CREATE TABLE dim_date (
  date_key    INT PRIMARY KEY,  -- YYYYMMDD
  full_date   DATE,
  day_of_week INT,
  month_num   INT,
  month_name  VARCHAR,
  quarter     INT,
  year        INT,
  is_weekend  BOOLEAN,
  is_holiday  BOOLEAN
);

Slowly Changing Dimensions (SCD) Types

Type	Behavior	When to Use
Type 0	Never change (insert-only or frozen)	Birth date, original signup source
Type 1	Overwrite — lose history	Typos, corrections, non-analytical attributes
Type 2	New row per change with effective dates + is_current flag	History matters (customer tier changes, address changes)
Type 3	Previous value stored in extra column	Track exactly 1 prior value (rare)
Type 4	Current row in main table, history in separate table	Hot/cold split when history is huge
Type 6 (1+2+3)	Hybrid: overwrite for corrections + versioning for real changes	`Complex CRM data`

SCD Type 2 MERGE (Snowflake / dbt-style)

-- Close current row when any tracked attribute changes
MERGE INTO dim_customer d
USING (
  SELECT
    s.customer_id, s.full_name, s.email, s.country, s.tier,
    CURRENT_DATE() AS load_date
  FROM staging.customers s
) s
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (
    d.full_name <> s.full_name OR
    d.email    <> s.email    OR
    d.country  <> s.country  OR
    d.tier     <> s.tier
) THEN UPDATE SET
    d.effective_to = s.load_date - 1,
    d.is_current  = FALSE;

-- Insert new current version (and brand-new customers)
INSERT INTO dim_customer (
  customer_key, customer_id, full_name, email, country, tier,
  effective_from, effective_to, is_current
)
SELECT
  dim_customer_seq.NEXTVAL,
  s.customer_id, s.full_name, s.email, s.country, s.tier,
  CURRENT_DATE(), DATE '9999-12-31', TRUE
FROM staging.customers s
LEFT JOIN dim_customer d
  ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHERE d.customer_id IS NULL
   OR (d.full_name <> s.full_name OR d.email <> s.email
       OR d.country <> s.country  OR d.tier  <> s.tier);

Modeling Rules That Save You Later

✓Always define the grain of a fact in one sentence before writing DDL — "one row per order line per day"
✓Use surrogate keys (BIGINT sequences) on dimensions — never foreign-key to a natural key that can change
✓Conform dimensions across facts — dim_customer used by fct_orders AND fct_support should be the same table
✓Never join fact to fact — always go through a dimension (or create a bridge table)
✓Degenerate dimensions (order_id on fct_order_lines) are fine — not every attribute needs a dim table
✓Store additive measures in facts (quantity, amount); derive ratios in the BI layer
✓Use dim_date and dim_time_of_day — never compute date parts in ad-hoc queries
✓Junk dimensions group unrelated flags (is_return, payment_type) — keeps fact table narrow
✓For Data Vault, Hubs hold business keys, Links capture relationships, Satellites store attributes with load_date
✓On columnar warehouses (Snowflake/BigQuery), consider OBT (denormalized wide table) for simple dashboards — joins are cheap but wide scans are cheaper

Common Interview Q&A

Q: When would you pick star schema over One Big Table on Snowflake?

Star schema wins when: (1) Dimensions have rich attributes used across many facts — reuse saves storage and ensures consistency. (2) Dimensions change (SCD Type 2) — updating one row in a dim is cheaper than updating millions of fact rows. (3) BI tool expects star (many dashboards auto-join on dims). OBT wins when: (1) Single fact, shallow analytics. (2) Aggressive denormalization for fastest dashboard scans. (3) Team is small and schema evolution is infrequent. Many teams use star for core and OBT for specific high-volume dashboards.

Q: How do you design for late-arriving facts?

Late-arriving fact scenario: order row arrives after the customer dim has been updated to a newer version. Options: (1) Use is_current + effective dates on dim: look up the dim row whose effective_from/to range covers the fact event_date. (2) Store the snapshot date on the fact and always join on business key + date range (slower but accurate). (3) Accept "as-is at load time" if business is OK with that. In dbt, a macro like scd_join(fact.event_date, dim, business_key) encapsulates the range lookup.

Q: How is Data Vault different from Kimball and when do you use it?

Data Vault 2.0 separates keys (Hubs), relationships (Links), and descriptive data (Satellites). Advantages: append-only (easy audit), agile (add sources without redesign), parallel loading. Disadvantages: 3-5x more tables than Kimball, not query-friendly — you always build a Kimball star on top for BI. Use Data Vault for: enterprise warehouses with 50+ sources, regulated industries needing full auditability, teams with dedicated modelers. Skip for small-to-mid analytics where Kimball ships faster.

Q: Explain additive, semi-additive, and non-additive facts.

Additive: can be summed across all dimensions (revenue, quantity). Semi-additive: summable across some dimensions, not time (account balance — sum across customers at a point in time, but not across days). Non-additive: never summable (unit price, ratios, percentages). Implication: semi-additive needs period-ending snapshots (fct_balance_daily), non-additive should be derived at query time, not stored pre-aggregated.

Frequently Asked Questions

Is dimensional modeling still relevant in 2026?

Yes — even with columnar warehouses (Snowflake, BigQuery, Databricks SQL) where joins are cheap, dimensional modeling remains the clearest way to structure analytics. What has changed: you can denormalize more aggressively (OBT for simple cases), skip snowflake schemas (storage is cheap), and use column-level lineage tools. The core idea — facts at a clear grain + conformed dimensions — still prevents the BI chaos that plagues ad-hoc schemas.

What is the Medallion architecture and how does it relate to Kimball?

Medallion (Bronze → Silver → Gold) is a lakehouse LAYERING pattern, not a modeling approach. Bronze = raw, Silver = cleaned/joined, Gold = business-ready. You still need a modeling approach inside Gold — typically Kimball star schema or OBT. Think of Medallion as the zones (promotion stages) and Kimball as what you model within the Gold zone.

SCD Type 2 vs snapshots — which should I use?

SCD Type 2: one row per version with effective_from/to. Efficient storage (only writes on change), needs careful joins on date ranges. Snapshots: one row per entity per day (or week). Simpler joins (dim_customer_20260415), but 365x storage for a daily snapshot. Use SCD Type 2 for large dims (millions of rows) and snapshots for small, high-change-rate dims where simplicity matters. dbt snapshots generate SCD Type 2 automatically.

How do surrogate keys help and when can I skip them?

Surrogate keys (auto-increment integers) insulate dims from source-system changes. Benefits: (1) Stable across source re-keyings. (2) Smaller joins than long natural keys (UUID strings). (3) Allow SCD Type 2 (multiple rows per natural key need a unique surrogate). Skip when: dim is static (dim_date uses YYYYMMDD), performance is fine on natural key, no SCD Type 2 needed. On Snowflake, sequences or IDENTITY columns generate surrogates cheaply.

Should I build a data mart per team or one centralized warehouse?

Modern answer: one centralized conformed warehouse (shared facts + dims) + team-specific marts on top (team views or small mart schemas). This avoids the silo-per-team anti-pattern where every team rebuilds customer dim differently. dbt project structure supports this: models/core/ (shared) and models/marts/<team>/ (team-specific). Access control via schema-level grants.

How do I handle many-to-many relationships in a star schema?

Use a bridge table between the fact and the dim. Example: an order can have multiple promotions. Create bridge_order_promotion(order_id, promotion_key, weighting_factor) so aggregations can allocate revenue correctly. Bridges can explode row counts, so always document allocation semantics and test against ungrouped totals.

What is a junk dimension and why use one?

A junk dimension combines several low-cardinality flags (is_return, payment_type, shipping_method) into one dim with all valid combinations. Keeps fact table narrow (one key vs 3 columns), centralizes flag definitions, and allows adding new attributes without altering the fact. Especially useful when facts are billions of rows.

Related Cheat Sheets

Snowflake SQL Cheat Sheet dbt Commands Cheat Sheet Snowflake Best Practices for Data Engineers Data Engineering Interview Questions & Answers

← All Cheat Sheets