Azure for Data Engineers Cheat Sheet

IntermediateLast updated: 2026-04-16 • 6 sections

Key Azure services for data engineering: ADLS Gen2, Data Factory, Synapse, Databricks, Functions, Purview, and common architecture patterns.

Core Azure Services for Data

ServicePurposeClosest AWS Equivalent
ADLS Gen2Data lake storage with hierarchical namespaceS3
Azure Data Factory (ADF)Managed ETL/ELT orchestrationGlue + Step Functions
Synapse AnalyticsMPP data warehouse + Spark pools + serverless SQLRedshift + EMR + Athena
Azure DatabricksManaged Spark / Delta LakeEMR / Databricks on AWS
Microsoft FabricUnified SaaS data platform (OneLake + Spark + Warehouse)no direct equivalent
Event HubsManaged Kafka-compatible streamingKinesis / MSK
FunctionsServerless code triggersLambda
Cosmos DBMulti-model NoSQL with global distributionDynamoDB
PurviewData governance, lineage, catalogLake Formation + Glue Catalog
Key VaultSecrets / keys / certsSecrets Manager + KMS
Entra ID (AAD)Identity providerIAM Identity Center

ADLS Gen2 Zones & Partitioning

  • Enable hierarchical namespace at storage-account creation — cannot be added later; required for ACLs and Spark perf
  • Zone layout: landing/ → raw/ → curated/ → consumption/ — immutable promotion between zones
  • Partition Parquet by date (year=/month=/day=) first; low-cardinality filters (country) second
  • Use ABFS driver (abfss://[email protected]/path) in Spark — not WASB (legacy)
  • Set lifecycle policies: move to Cool at 30 days, Archive at 180 days, delete raw at 365 days
  • Use ADLS ACLs (POSIX-style) for fine-grained access; RBAC for container-level access
  • Enable soft delete + versioning on critical containers for accidental-delete protection
  • Enable diagnostic logs to Log Analytics for access auditing; integrate with Defender for Storage
  • Use private endpoints + firewall rules — never leave storage accounts publicly accessible
  • For cross-region DR, use RA-GZRS replication + geo-redundant backup jobs

ADF Pipeline — Copy + Transform Pattern

{
  "name": "pl_orders_daily",
  "properties": {
    "parameters": { "runDate": { "type": "String" } },
    "activities": [
      {
        "name": "Copy_OnPrem_to_Raw",
        "type": "Copy",
        "inputs": [{ "referenceName": "ds_onprem_orders", "type": "DatasetReference" }],
        "outputs": [{ "referenceName": "ds_raw_orders", "type": "DatasetReference",
          "parameters": { "dt": "@pipeline().parameters.runDate" } }],
        "typeProperties": {
          "source": { "type": "SqlServerSource",
            "sqlReaderQuery": "SELECT * FROM dbo.orders WHERE updated_at >= '@{pipeline().parameters.runDate}'" },
          "sink": { "type": "ParquetSink" },
          "enableStaging": true
        },
        "policy": { "retry": 3, "retryIntervalInSeconds": 30, "timeout": "02:00:00" }
      },
      {
        "name": "Transform_with_Databricks",
        "type": "DatabricksNotebook",
        "dependsOn": [{ "activity": "Copy_OnPrem_to_Raw", "dependencyConditions": ["Succeeded"] }],
        "typeProperties": {
          "notebookPath": "/etl/transform_orders",
          "baseParameters": { "run_date": "@pipeline().parameters.runDate" }
        },
        "linkedServiceName": { "referenceName": "ls_databricks", "type": "LinkedServiceReference" }
      }
    ]
  }
}

Synapse Serverless SQL Patterns

-- Query Parquet directly from ADLS (no ingestion)
SELECT
    country,
    SUM(amount) AS revenue,
    COUNT(*) AS orders
FROM OPENROWSET(
    BULK 'https://acct.dfs.core.windows.net/curated/orders/year=2026/**',
    FORMAT = 'PARQUET'
) AS rows
GROUP BY country;

-- Create an external table on the Data Catalog
CREATE EXTERNAL TABLE dbo.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(12,2),
    order_date DATE
)
WITH (
    LOCATION = 'curated/orders/',
    DATA_SOURCE = ds_curated,
    FILE_FORMAT = ff_parquet
);

-- Use CETAS to materialize aggregates (fast future reads)
CREATE EXTERNAL TABLE dbo.orders_monthly
WITH (LOCATION='curated/summary/orders_monthly/', FILE_FORMAT=ff_parquet, DATA_SOURCE=ds_curated)
AS
SELECT order_month, country, SUM(amount) AS revenue
FROM dbo.orders
GROUP BY order_month, country;

Azure vs Snowflake on Azure

Q: Should I use Synapse dedicated pools or Snowflake on Azure?

Snowflake on Azure typically wins on: (1) Per-second billing and auto-suspend (Synapse dedicated pools bill 24x7 while paused, and pause takes 30+ min). (2) Instant warehouse resize, zero-copy clones, time travel. (3) Broader BI ecosystem. Synapse wins on: (1) Tight integration with ADF, Power BI Premium, Purview out of the box. (2) Reserved capacity discounts when fully utilized. (3) No cross-cloud egress if your estate is 100% Azure. In 2026, most teams pick Snowflake on Azure for analytics and use Synapse Pipelines/Serverless for orchestration + ad-hoc.

Q: Fabric OneLake vs Databricks Unity Catalog vs Snowflake Horizon?

OneLake: SaaS data lake tied to Fabric, auto-creates shortcut views across workspaces, good for Microsoft-centric orgs. Unity Catalog: Databricks governance across clouds, strong on Delta tables, lineage, fine-grained access. Snowflake Horizon: built into Snowflake, unifies governance + quality + discovery, plus data sharing marketplace. Pick by where your primary workloads run: Fabric if all-Microsoft, UC if Databricks-primary, Horizon if Snowflake-primary.

Q: When use Data Factory vs Synapse Pipelines vs Fabric Data Pipelines?

They are the same engine under different brandings. ADF: standalone, most mature, wider connector set. Synapse Pipelines: ADF embedded in Synapse workspace. Fabric Data Pipelines: ADF embedded in Fabric. Pick by where the rest of your stack lives — avoid mixing. For pure pipeline features (triggers, copy, data flows, orchestration) they are equivalent.

Security & Governance Baseline

  • Use Managed Identity everywhere — never store connection strings or keys in pipelines
  • Enable private endpoints for ADLS, Synapse, Databricks — no public internet traffic
  • Store secrets in Key Vault, reference via Linked Services in ADF/Synapse
  • Enable Defender for Cloud (Storage + SQL + Open-source databases)
  • Configure Purview scans on ADLS + Synapse + Databricks for lineage and classification
  • Enforce RBAC via Entra ID groups, not individual users
  • Enable diagnostic logs to Log Analytics + Sentinel for security monitoring
  • For cross-region DR use geo-redundant storage + paired regions + tested runbooks

Frequently Asked Questions

How is Microsoft Fabric different from Synapse and ADF?

Fabric is the unified SaaS successor: OneLake (data lake) + Data Factory (pipelines) + Synapse (warehouse + Spark + real-time) + Power BI, all on a shared capacity-based SKU. Synapse and ADF continue to exist for customers who need PaaS control (VNet injection, specific SKUs). Rule of thumb: Fabric for greenfield Microsoft-centric analytics, Synapse/ADF when you need fine-grained networking/compute control or integration with non-Microsoft tools.

ADLS Gen2 vs Gen1 vs Blob Storage?

Gen1 is deprecated (migrate by Feb 2024 deadline passed). Blob Storage is the underlying object store — cheap, flat namespace, no ACLs. ADLS Gen2 is Blob Storage with hierarchical namespace enabled — adds directories, ACLs, much better Spark/Hadoop performance. Always use Gen2 for data lakes. Hierarchical namespace must be enabled at account creation (cannot be added later).

Databricks on Azure vs Azure Synapse Spark Pools?

Databricks wins on: Delta Lake, MLflow, Unity Catalog, auto-scaling, broader Spark feature set, Photon engine. Synapse Spark wins on: integration with Synapse SQL (single workspace), no separate billing, included in Synapse capacity. Most serious Spark teams pick Databricks; Synapse Spark is fine for light integration scenarios already in Synapse.

How do I ingest streaming data into Azure?

Options: (1) Event Hubs + Stream Analytics for SQL-like stream processing with ADLS sink. (2) Event Hubs + Databricks Structured Streaming for complex streams into Delta Lake. (3) Event Hubs Kafka surface + Snowflake Kafka connector (Snowpipe Streaming) for sub-10s ingestion into Snowflake. (4) IoT Hub for device telemetry, routes to Event Hubs. Pick based on destination and processing complexity.

What is the right orchestration tool on Azure?

ADF / Synapse Pipelines for most data ETL — cheap, no infra, rich connectors. Azure Data Factory managed Airflow for teams needing Airflow DAGs (dbt Cloud, cross-cloud, complex Python). Logic Apps for business process flows (approvals, emails, SaaS connectors). Azure Functions Durable for lightweight workflows. Use ADF as the default and only move to Airflow when you have existing DAGs or need Python-heavy orchestration.

How do I set up least-privilege for an Azure data lake?

Layer: (1) RBAC at the storage account / container: Storage Blob Data Reader / Contributor. (2) ACLs on specific paths for finer control. (3) Entra ID groups (never individual users). (4) Managed Identity for services (ADF, Databricks, Synapse) — never shared keys. (5) Azure Policy to prevent creation of public containers. (6) Defender for Storage to detect anomalies. Purview for auto-classification.

Can I use Snowflake with Azure efficiently?

Yes — Snowflake runs natively on Azure regions. Pattern: ADLS as landing zone → Snowflake external stage (Azure SAS or managed identity) → Snowpipe for auto-ingest → dbt or Streams/Tasks for transformation. Use Entra ID federation so Snowflake logins use the same identities as Azure. Private Link between ADF and Snowflake avoids public egress charges.

Related Cheat Sheets

AWS for Data Engineers Cheat SheetDatabricks SQL & Workflows Cheat SheetPython for Data Engineers Cheat SheetApache Airflow Cheat Sheet
← All Cheat Sheets