IntermediateLast updated: 2026-04-16 • 6 sections
Databricks essentials: Delta Lake, Unity Catalog, SQL warehouses, workflows, Auto Loader, and performance tuning. Includes comparisons to Snowflake.
-- Create managed Delta table (Unity Catalog)
CREATE TABLE main.sales.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(12,2),
order_date DATE
)
USING DELTA
PARTITIONED BY (order_date);
-- MERGE (upsert)
MERGE INTO main.sales.orders t
USING staging.orders_new s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Time travel — query data as of a timestamp or version
SELECT * FROM main.sales.orders TIMESTAMP AS OF '2026-04-01T00:00:00';
SELECT * FROM main.sales.orders VERSION AS OF 42;
-- Compact small files
OPTIMIZE main.sales.orders;
-- Z-ORDER for multi-column filter pruning
OPTIMIZE main.sales.orders ZORDER BY (customer_id, product_id);
-- Vacuum old files (default 7-day retention)
VACUUM main.sales.orders RETAIN 168 HOURS;
-- Change Data Feed (CDC output)
ALTER TABLE main.sales.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT * FROM table_changes('main.sales.orders', 42);| Level | Grant Example | Use Case |
|---|---|---|
| Metastore | GRANT CREATE CATALOG ON METASTORE TO `admins` | Account-level admin |
| Catalog | GRANT USE CATALOG ON CATALOG main TO `analysts` | Business domain isolation |
| Schema | GRANT USE SCHEMA ON SCHEMA main.sales TO `sales_team` | Team boundary |
| Table | GRANT SELECT ON TABLE main.sales.orders TO `analysts` | Standard table access |
| Column | Dynamic views or column masks | Column-level PII protection |
| Row | Row filter function (similar to RLS) | Multi-tenant data isolation |
| Volume | GRANT READ VOLUME ON VOLUME main.raw.landing TO `etl_role` | Access to files in ADLS/S3 |
from pyspark.sql.functions import col
stream = (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/_schemas/orders")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load("abfss://[email protected]/orders/")
)
clean = (
stream
.filter(col("amount") > 0)
.withColumn("ingest_ts", F.current_timestamp())
)
query = (
clean.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/_checkpoints/orders")
.option("mergeSchema", "true")
.trigger(availableNow=True) # process all new files, then stop
.toTable("main.sales.orders")
)
query.awaitTermination(){
"name": "orders_daily_etl",
"tasks": [
{
"task_key": "bronze_ingest",
"notebook_task": { "notebook_path": "/etl/bronze_orders" },
"job_cluster_key": "main_cluster"
},
{
"task_key": "silver_clean",
"depends_on": [{ "task_key": "bronze_ingest" }],
"notebook_task": { "notebook_path": "/etl/silver_orders" },
"job_cluster_key": "main_cluster"
},
{
"task_key": "gold_aggregate",
"depends_on": [{ "task_key": "silver_clean" }],
"sql_task": {
"query": { "query_id": "abc-123" },
"warehouse_id": "xyz-456"
}
}
],
"job_clusters": [{
"job_cluster_key": "main_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
"runtime_engine": "PHOTON"
}
}],
"schedule": { "quartz_cron_expression": "0 0 6 * * ?", "timezone_id": "UTC" },
"email_notifications": { "on_failure": ["[email protected]"] }
}Q: Databricks vs Snowflake for a new analytics platform?
Snowflake wins when: workload is SQL-heavy (BI, dashboards, ELT), team is SQL-first, governance/security is top priority, data sharing is important, you want zero infrastructure to manage. Databricks wins when: workload is Spark/ML-heavy, you need open table formats across engines (Delta + Iceberg), you have large unstructured data (images, logs), data science and ML are core. Many teams use both: Databricks for ML + ingestion, Snowflake for analytics + BI, sharing Iceberg tables between them.
Q: Delta Lake vs Iceberg vs Snowflake native tables?
Delta Lake: best on Databricks, strong Spark integration, CDF, Liquid Clustering. Iceberg: open standard, multi-engine (Snowflake, Spark, Trino, Flink), best for cross-tool portability. Snowflake native: auto-clustering, time travel, fastest in Snowflake. Databricks now also reads/writes Iceberg via UniForm. Pick Iceberg for maximum portability, Delta for Databricks-first, native Snowflake for Snowflake-first.
Q: Photon vs classic Databricks runtime?
Photon is a C++ vectorized query engine that replaces parts of Spark SQL execution. 2-8x faster on SQL/DataFrame workloads, same API. Costs 2x DBU (not 2x dollars — typically ~20% net cheaper due to speedup). Enable on SQL warehouses (default) and job clusters running heavy SQL. Skip on pure ML/UDF workloads where Photon does not apply.
Q: Serverless SQL warehouse vs classic SQL warehouse vs Jobs cluster?
Serverless SQL: sub-10s startup, fully managed by Databricks, best for BI. Classic SQL: runs in your cloud account VPC, longer startup, needed for VNet injection or private workloads. Jobs cluster: general Spark cluster for ETL notebooks/jars — cheapest for long batch runs. Use Serverless SQL for dashboards, Jobs for ETL, Classic only when compliance requires your VPC.
Run OPTIMIZE table_name to combine files into ~1GB targets. Add ZORDER BY (col1, col2) to co-locate by frequent filter columns. Enable auto-compaction at write time: ALTER TABLE t SET TBLPROPERTIES (delta.autoOptimize.autoCompact = true, delta.autoOptimize.optimizeWrite = true). In 2026 most teams use Liquid Clustering (CLUSTER BY) which removes the need for manual OPTIMIZE/ZORDER.
Managed tables: Databricks owns the storage location, DROP TABLE deletes data. External tables: point to a user-provided path (abfss://, s3://), DROP TABLE keeps data. Managed is simpler and recommended unless you share the same files with other engines (Synapse, Snowflake, Trino) — then use external. External tables require a EXTERNAL LOCATION + storage credential in Unity Catalog.
DLT is declarative pipeline framework: you define transformations, DLT manages dependency order, retries, expectations (data quality), and incremental refresh. Use DLT when: you have Spark streaming or incremental pipelines, you want built-in data quality tests (EXPECT clauses), you value less boilerplate. Skip DLT when: your pipelines are pure SQL on Delta (plain workflows are cheaper), your team already uses dbt (DLT and dbt overlap).
You pay two things: (1) DBUs (Databricks Units) — billed per second based on cluster type + size + features (Photon doubles DBU rate). (2) Cloud compute — underlying VM cost from AWS/Azure/GCP. Serverless flips this: Databricks bundles DBUs + compute into one rate. Typical: a DS3_v2 worker with Photon on Jobs compute runs ~$1-1.50/hour all-in. Rule: use Jobs compute for ETL (cheapest), Serverless SQL for BI (most productive), avoid all-purpose clusters for production (most expensive).
Use the snowflake-spark connector: spark.read.format("snowflake").option(...).load("SELECT * FROM t"). Authenticate via OAuth or key-pair (never plain passwords). For bulk reads, use COPY UNLOAD to S3/ADLS and read Parquet — faster than JDBC. For bidirectional sharing, use Iceberg catalog-linked databases so both engines see the same tables.
Workflows: built-in, native integration, cheaper for Databricks-only pipelines, supports task values, retries, conditional logic. Airflow: best when orchestrating across systems (dbt Cloud + Databricks + Snowflake + external APIs), mature alerting, rich operators. Rule: start with Workflows for Databricks-native pipelines; move to Airflow (or MWAA/Astronomer) once you orchestrate 3+ systems.
Checklist: (1) Query Profile — find the slowest stage, check for full file scans. (2) OPTIMIZE + ZORDER on the largest filtered columns. (3) Enable Photon if not already. (4) Check predicate pushdown — column stats may be stale; run ANALYZE TABLE t COMPUTE STATISTICS. (5) Use broadcast hints for small dims. (6) Enable result cache. (7) For repeated scans, use Disk Cache (delta.enable.disk-cache=true).