IntermediateLast updated: 2026-04-16 • 6 sections
Dimensional modeling, star vs snowflake schema, SCD Types 1/2/3/4, Data Vault 2.0, One Big Table, and choosing the right model for your warehouse.
| Approach | Best For | Key Downside |
|---|---|---|
| Star Schema (Kimball) | BI / reporting, dashboards | Slower to evolve when sources change |
| Snowflake Schema | Normalized dims save storage | More joins, slower queries |
| Data Vault 2.0 | Enterprise DW with many sources, auditability | Verbose, hard for BI without a mart on top |
| 3NF / Inmon | Atomic enterprise DW, then build marts | Heavy upfront design, slow to deliver |
| One Big Table (OBT) | Columnar warehouses (Snowflake, BigQuery), simple dashboards | Wide tables, duplicated dim data |
| Activity Schema | Product analytics, event-driven orgs | Few BI tools natively support it |
| Medallion (Bronze/Silver/Gold) | Databricks lakehouse layering | Orthogonal to modeling — not a replacement |
-- Fact table: grain = one row per order line item
CREATE TABLE fct_order_lines (
order_line_id BIGINT PRIMARY KEY,
order_id BIGINT,
order_date_key INT REFERENCES dim_date(date_key),
customer_key BIGINT REFERENCES dim_customer(customer_key),
product_key BIGINT REFERENCES dim_product(product_key),
store_key BIGINT REFERENCES dim_store(store_key),
quantity INT,
unit_price DECIMAL(10,2),
line_amount DECIMAL(12,2),
discount_amount DECIMAL(12,2)
);
-- Conformed dimension (SCD Type 2)
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY, -- surrogate
customer_id BIGINT, -- natural / business
full_name VARCHAR,
email VARCHAR,
country VARCHAR,
tier VARCHAR,
effective_from DATE,
effective_to DATE, -- NULL or 9999-12-31 for current
is_current BOOLEAN
);
-- Date dimension (build once, reuse across all facts)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD
full_date DATE,
day_of_week INT,
month_num INT,
month_name VARCHAR,
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);| Type | Behavior | When to Use |
|---|---|---|
| Type 0 | Never change (insert-only or frozen) | Birth date, original signup source |
| Type 1 | Overwrite — lose history | Typos, corrections, non-analytical attributes |
| Type 2 | New row per change with effective dates + is_current flag | History matters (customer tier changes, address changes) |
| Type 3 | Previous value stored in extra column | Track exactly 1 prior value (rare) |
| Type 4 | Current row in main table, history in separate table | Hot/cold split when history is huge |
| Type 6 (1+2+3) | Hybrid: overwrite for corrections + versioning for real changes | Complex CRM data |
-- Close current row when any tracked attribute changes
MERGE INTO dim_customer d
USING (
SELECT
s.customer_id, s.full_name, s.email, s.country, s.tier,
CURRENT_DATE() AS load_date
FROM staging.customers s
) s
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (
d.full_name <> s.full_name OR
d.email <> s.email OR
d.country <> s.country OR
d.tier <> s.tier
) THEN UPDATE SET
d.effective_to = s.load_date - 1,
d.is_current = FALSE;
-- Insert new current version (and brand-new customers)
INSERT INTO dim_customer (
customer_key, customer_id, full_name, email, country, tier,
effective_from, effective_to, is_current
)
SELECT
dim_customer_seq.NEXTVAL,
s.customer_id, s.full_name, s.email, s.country, s.tier,
CURRENT_DATE(), DATE '9999-12-31', TRUE
FROM staging.customers s
LEFT JOIN dim_customer d
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHERE d.customer_id IS NULL
OR (d.full_name <> s.full_name OR d.email <> s.email
OR d.country <> s.country OR d.tier <> s.tier);Q: When would you pick star schema over One Big Table on Snowflake?
Star schema wins when: (1) Dimensions have rich attributes used across many facts — reuse saves storage and ensures consistency. (2) Dimensions change (SCD Type 2) — updating one row in a dim is cheaper than updating millions of fact rows. (3) BI tool expects star (many dashboards auto-join on dims). OBT wins when: (1) Single fact, shallow analytics. (2) Aggressive denormalization for fastest dashboard scans. (3) Team is small and schema evolution is infrequent. Many teams use star for core and OBT for specific high-volume dashboards.
Q: How do you design for late-arriving facts?
Late-arriving fact scenario: order row arrives after the customer dim has been updated to a newer version. Options: (1) Use is_current + effective dates on dim: look up the dim row whose effective_from/to range covers the fact event_date. (2) Store the snapshot date on the fact and always join on business key + date range (slower but accurate). (3) Accept "as-is at load time" if business is OK with that. In dbt, a macro like scd_join(fact.event_date, dim, business_key) encapsulates the range lookup.
Q: How is Data Vault different from Kimball and when do you use it?
Data Vault 2.0 separates keys (Hubs), relationships (Links), and descriptive data (Satellites). Advantages: append-only (easy audit), agile (add sources without redesign), parallel loading. Disadvantages: 3-5x more tables than Kimball, not query-friendly — you always build a Kimball star on top for BI. Use Data Vault for: enterprise warehouses with 50+ sources, regulated industries needing full auditability, teams with dedicated modelers. Skip for small-to-mid analytics where Kimball ships faster.
Q: Explain additive, semi-additive, and non-additive facts.
Additive: can be summed across all dimensions (revenue, quantity). Semi-additive: summable across some dimensions, not time (account balance — sum across customers at a point in time, but not across days). Non-additive: never summable (unit price, ratios, percentages). Implication: semi-additive needs period-ending snapshots (fct_balance_daily), non-additive should be derived at query time, not stored pre-aggregated.
Yes — even with columnar warehouses (Snowflake, BigQuery, Databricks SQL) where joins are cheap, dimensional modeling remains the clearest way to structure analytics. What has changed: you can denormalize more aggressively (OBT for simple cases), skip snowflake schemas (storage is cheap), and use column-level lineage tools. The core idea — facts at a clear grain + conformed dimensions — still prevents the BI chaos that plagues ad-hoc schemas.
Medallion (Bronze → Silver → Gold) is a lakehouse LAYERING pattern, not a modeling approach. Bronze = raw, Silver = cleaned/joined, Gold = business-ready. You still need a modeling approach inside Gold — typically Kimball star schema or OBT. Think of Medallion as the zones (promotion stages) and Kimball as what you model within the Gold zone.
SCD Type 2: one row per version with effective_from/to. Efficient storage (only writes on change), needs careful joins on date ranges. Snapshots: one row per entity per day (or week). Simpler joins (dim_customer_20260415), but 365x storage for a daily snapshot. Use SCD Type 2 for large dims (millions of rows) and snapshots for small, high-change-rate dims where simplicity matters. dbt snapshots generate SCD Type 2 automatically.
Surrogate keys (auto-increment integers) insulate dims from source-system changes. Benefits: (1) Stable across source re-keyings. (2) Smaller joins than long natural keys (UUID strings). (3) Allow SCD Type 2 (multiple rows per natural key need a unique surrogate). Skip when: dim is static (dim_date uses YYYYMMDD), performance is fine on natural key, no SCD Type 2 needed. On Snowflake, sequences or IDENTITY columns generate surrogates cheaply.
Modern answer: one centralized conformed warehouse (shared facts + dims) + team-specific marts on top (team views or small mart schemas). This avoids the silo-per-team anti-pattern where every team rebuilds customer dim differently. dbt project structure supports this: models/core/ (shared) and models/marts/<team>/ (team-specific). Access control via schema-level grants.
Use a bridge table between the fact and the dim. Example: an order can have multiple promotions. Create bridge_order_promotion(order_id, promotion_key, weighting_factor) so aggregations can allocate revenue correctly. Bridges can explode row counts, so always document allocation semantics and test against ungrouped totals.
A junk dimension combines several low-cardinality flags (is_return, payment_type, shipping_method) into one dim with all valid combinations. Keeps fact table narrow (one key vs 3 columns), centralizes flag definitions, and allows adding new attributes without altering the fact. Especially useful when facts are billions of rows.