Databricks SQL & Workflows Cheat Sheet

IntermediateLast updated: 2026-04-16 • 6 sections

Databricks essentials: Delta Lake, Unity Catalog, SQL warehouses, workflows, Auto Loader, and performance tuning. Includes comparisons to Snowflake.

Delta Lake Essentials

-- Create managed Delta table (Unity Catalog)
CREATE TABLE main.sales.orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(12,2),
  order_date DATE
)
USING DELTA
PARTITIONED BY (order_date);

-- MERGE (upsert)
MERGE INTO main.sales.orders t
USING staging.orders_new s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel — query data as of a timestamp or version
SELECT * FROM main.sales.orders TIMESTAMP AS OF '2026-04-01T00:00:00';
SELECT * FROM main.sales.orders VERSION AS OF 42;

-- Compact small files
OPTIMIZE main.sales.orders;

-- Z-ORDER for multi-column filter pruning
OPTIMIZE main.sales.orders ZORDER BY (customer_id, product_id);

-- Vacuum old files (default 7-day retention)
VACUUM main.sales.orders RETAIN 168 HOURS;

-- Change Data Feed (CDC output)
ALTER TABLE main.sales.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT * FROM table_changes('main.sales.orders', 42);

Unity Catalog Access Model

LevelGrant ExampleUse Case
MetastoreGRANT CREATE CATALOG ON METASTORE TO `admins`Account-level admin
CatalogGRANT USE CATALOG ON CATALOG main TO `analysts`Business domain isolation
SchemaGRANT USE SCHEMA ON SCHEMA main.sales TO `sales_team`Team boundary
TableGRANT SELECT ON TABLE main.sales.orders TO `analysts`Standard table access
ColumnDynamic views or column masksColumn-level PII protection
RowRow filter function (similar to RLS)Multi-tenant data isolation
VolumeGRANT READ VOLUME ON VOLUME main.raw.landing TO `etl_role`Access to files in ADLS/S3

Auto Loader — Incremental File Ingestion

from pyspark.sql.functions import col

stream = (
    spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/_schemas/orders")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("abfss://[email protected]/orders/")
)

clean = (
    stream
    .filter(col("amount") > 0)
    .withColumn("ingest_ts", F.current_timestamp())
)

query = (
    clean.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/_checkpoints/orders")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)  # process all new files, then stop
    .toTable("main.sales.orders")
)
query.awaitTermination()

SQL Warehouse Sizing & Tuning

  • Use Serverless SQL Warehouses for BI dashboards — sub-10s startup, auto-scale, no cluster management
  • Size by concurrent query count: 2XS handles ~10 concurrent light queries; M handles ~25; L handles ~50+
  • Enable Photon — vectorized query engine, 2-8x faster on SQL workloads
  • Enable multi-cluster load balancing: min=1, max=3 so spikes scale without over-provisioning
  • Auto-stop after 10-15 min idle for BI; 5 min for ad-hoc warehouses
  • Use Query History tab to find slow queries; Query Profile shows which stage is the bottleneck
  • Result cache is per-warehouse; disk cache (SSD) speeds repeated scans of the same Delta files
  • For ETL, prefer Jobs compute (Spark clusters) over SQL Warehouses — cheaper for long batch work
  • OPTIMIZE + ZORDER weekly on large tables improves pruning 5-20x
  • Monitor Lakeview dashboards or system.query.history table for cost attribution

Databricks Workflows (Jobs)

{
  "name": "orders_daily_etl",
  "tasks": [
    {
      "task_key": "bronze_ingest",
      "notebook_task": { "notebook_path": "/etl/bronze_orders" },
      "job_cluster_key": "main_cluster"
    },
    {
      "task_key": "silver_clean",
      "depends_on": [{ "task_key": "bronze_ingest" }],
      "notebook_task": { "notebook_path": "/etl/silver_orders" },
      "job_cluster_key": "main_cluster"
    },
    {
      "task_key": "gold_aggregate",
      "depends_on": [{ "task_key": "silver_clean" }],
      "sql_task": {
        "query": { "query_id": "abc-123" },
        "warehouse_id": "xyz-456"
      }
    }
  ],
  "job_clusters": [{
    "job_cluster_key": "main_cluster",
    "new_cluster": {
      "spark_version": "15.4.x-scala2.12",
      "node_type_id": "Standard_DS3_v2",
      "num_workers": 4,
      "runtime_engine": "PHOTON"
    }
  }],
  "schedule": { "quartz_cron_expression": "0 0 6 * * ?", "timezone_id": "UTC" },
  "email_notifications": { "on_failure": ["[email protected]"] }
}

Databricks vs Snowflake — When to Pick Each

Q: Databricks vs Snowflake for a new analytics platform?

Snowflake wins when: workload is SQL-heavy (BI, dashboards, ELT), team is SQL-first, governance/security is top priority, data sharing is important, you want zero infrastructure to manage. Databricks wins when: workload is Spark/ML-heavy, you need open table formats across engines (Delta + Iceberg), you have large unstructured data (images, logs), data science and ML are core. Many teams use both: Databricks for ML + ingestion, Snowflake for analytics + BI, sharing Iceberg tables between them.

Q: Delta Lake vs Iceberg vs Snowflake native tables?

Delta Lake: best on Databricks, strong Spark integration, CDF, Liquid Clustering. Iceberg: open standard, multi-engine (Snowflake, Spark, Trino, Flink), best for cross-tool portability. Snowflake native: auto-clustering, time travel, fastest in Snowflake. Databricks now also reads/writes Iceberg via UniForm. Pick Iceberg for maximum portability, Delta for Databricks-first, native Snowflake for Snowflake-first.

Q: Photon vs classic Databricks runtime?

Photon is a C++ vectorized query engine that replaces parts of Spark SQL execution. 2-8x faster on SQL/DataFrame workloads, same API. Costs 2x DBU (not 2x dollars — typically ~20% net cheaper due to speedup). Enable on SQL warehouses (default) and job clusters running heavy SQL. Skip on pure ML/UDF workloads where Photon does not apply.

Q: Serverless SQL warehouse vs classic SQL warehouse vs Jobs cluster?

Serverless SQL: sub-10s startup, fully managed by Databricks, best for BI. Classic SQL: runs in your cloud account VPC, longer startup, needed for VNet injection or private workloads. Jobs cluster: general Spark cluster for ETL notebooks/jars — cheapest for long batch runs. Use Serverless SQL for dashboards, Jobs for ETL, Classic only when compliance requires your VPC.

Frequently Asked Questions

How do I compact small files in Delta Lake?

Run OPTIMIZE table_name to combine files into ~1GB targets. Add ZORDER BY (col1, col2) to co-locate by frequent filter columns. Enable auto-compaction at write time: ALTER TABLE t SET TBLPROPERTIES (delta.autoOptimize.autoCompact = true, delta.autoOptimize.optimizeWrite = true). In 2026 most teams use Liquid Clustering (CLUSTER BY) which removes the need for manual OPTIMIZE/ZORDER.

What is the difference between managed and external tables in Unity Catalog?

Managed tables: Databricks owns the storage location, DROP TABLE deletes data. External tables: point to a user-provided path (abfss://, s3://), DROP TABLE keeps data. Managed is simpler and recommended unless you share the same files with other engines (Synapse, Snowflake, Trino) — then use external. External tables require a EXTERNAL LOCATION + storage credential in Unity Catalog.

Databricks Delta Live Tables (DLT) — when should I use it?

DLT is declarative pipeline framework: you define transformations, DLT manages dependency order, retries, expectations (data quality), and incremental refresh. Use DLT when: you have Spark streaming or incremental pipelines, you want built-in data quality tests (EXPECT clauses), you value less boilerplate. Skip DLT when: your pipelines are pure SQL on Delta (plain workflows are cheaper), your team already uses dbt (DLT and dbt overlap).

How does Databricks cost work — DBUs vs compute?

You pay two things: (1) DBUs (Databricks Units) — billed per second based on cluster type + size + features (Photon doubles DBU rate). (2) Cloud compute — underlying VM cost from AWS/Azure/GCP. Serverless flips this: Databricks bundles DBUs + compute into one rate. Typical: a DS3_v2 worker with Photon on Jobs compute runs ~$1-1.50/hour all-in. Rule: use Jobs compute for ETL (cheapest), Serverless SQL for BI (most productive), avoid all-purpose clusters for production (most expensive).

How do I connect Databricks to Snowflake?

Use the snowflake-spark connector: spark.read.format("snowflake").option(...).load("SELECT * FROM t"). Authenticate via OAuth or key-pair (never plain passwords). For bulk reads, use COPY UNLOAD to S3/ADLS and read Parquet — faster than JDBC. For bidirectional sharing, use Iceberg catalog-linked databases so both engines see the same tables.

Should I use Databricks Workflows or Airflow?

Workflows: built-in, native integration, cheaper for Databricks-only pipelines, supports task values, retries, conditional logic. Airflow: best when orchestrating across systems (dbt Cloud + Databricks + Snowflake + external APIs), mature alerting, rich operators. Rule: start with Workflows for Databricks-native pipelines; move to Airflow (or MWAA/Astronomer) once you orchestrate 3+ systems.

How do I optimize slow Databricks SQL queries?

Checklist: (1) Query Profile — find the slowest stage, check for full file scans. (2) OPTIMIZE + ZORDER on the largest filtered columns. (3) Enable Photon if not already. (4) Check predicate pushdown — column stats may be stale; run ANALYZE TABLE t COMPUTE STATISTICS. (5) Use broadcast hints for small dims. (6) Enable result cache. (7) For repeated scans, use Disk Cache (delta.enable.disk-cache=true).

Related Cheat Sheets

PySpark & Spark SQL Cheat SheetAWS for Data Engineers Cheat SheetAzure for Data Engineers Cheat SheetSnowflake Best Practices for Data Engineers
← All Cheat Sheets