IntermediateLast updated: 2026-04-16 • 6 sections
Key Azure services for data engineering: ADLS Gen2, Data Factory, Synapse, Databricks, Functions, Purview, and common architecture patterns.
| Service | Purpose | Closest AWS Equivalent |
|---|---|---|
| ADLS Gen2 | Data lake storage with hierarchical namespace | S3 |
| Azure Data Factory (ADF) | Managed ETL/ELT orchestration | Glue + Step Functions |
| Synapse Analytics | MPP data warehouse + Spark pools + serverless SQL | Redshift + EMR + Athena |
| Azure Databricks | Managed Spark / Delta Lake | EMR / Databricks on AWS |
| Microsoft Fabric | Unified SaaS data platform (OneLake + Spark + Warehouse) | no direct equivalent |
| Event Hubs | Managed Kafka-compatible streaming | Kinesis / MSK |
| Functions | Serverless code triggers | Lambda |
| Cosmos DB | Multi-model NoSQL with global distribution | DynamoDB |
| Purview | Data governance, lineage, catalog | Lake Formation + Glue Catalog |
| Key Vault | Secrets / keys / certs | Secrets Manager + KMS |
| Entra ID (AAD) | Identity provider | IAM Identity Center |
{
"name": "pl_orders_daily",
"properties": {
"parameters": { "runDate": { "type": "String" } },
"activities": [
{
"name": "Copy_OnPrem_to_Raw",
"type": "Copy",
"inputs": [{ "referenceName": "ds_onprem_orders", "type": "DatasetReference" }],
"outputs": [{ "referenceName": "ds_raw_orders", "type": "DatasetReference",
"parameters": { "dt": "@pipeline().parameters.runDate" } }],
"typeProperties": {
"source": { "type": "SqlServerSource",
"sqlReaderQuery": "SELECT * FROM dbo.orders WHERE updated_at >= '@{pipeline().parameters.runDate}'" },
"sink": { "type": "ParquetSink" },
"enableStaging": true
},
"policy": { "retry": 3, "retryIntervalInSeconds": 30, "timeout": "02:00:00" }
},
{
"name": "Transform_with_Databricks",
"type": "DatabricksNotebook",
"dependsOn": [{ "activity": "Copy_OnPrem_to_Raw", "dependencyConditions": ["Succeeded"] }],
"typeProperties": {
"notebookPath": "/etl/transform_orders",
"baseParameters": { "run_date": "@pipeline().parameters.runDate" }
},
"linkedServiceName": { "referenceName": "ls_databricks", "type": "LinkedServiceReference" }
}
]
}
}-- Query Parquet directly from ADLS (no ingestion)
SELECT
country,
SUM(amount) AS revenue,
COUNT(*) AS orders
FROM OPENROWSET(
BULK 'https://acct.dfs.core.windows.net/curated/orders/year=2026/**',
FORMAT = 'PARQUET'
) AS rows
GROUP BY country;
-- Create an external table on the Data Catalog
CREATE EXTERNAL TABLE dbo.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(12,2),
order_date DATE
)
WITH (
LOCATION = 'curated/orders/',
DATA_SOURCE = ds_curated,
FILE_FORMAT = ff_parquet
);
-- Use CETAS to materialize aggregates (fast future reads)
CREATE EXTERNAL TABLE dbo.orders_monthly
WITH (LOCATION='curated/summary/orders_monthly/', FILE_FORMAT=ff_parquet, DATA_SOURCE=ds_curated)
AS
SELECT order_month, country, SUM(amount) AS revenue
FROM dbo.orders
GROUP BY order_month, country;Q: Should I use Synapse dedicated pools or Snowflake on Azure?
Snowflake on Azure typically wins on: (1) Per-second billing and auto-suspend (Synapse dedicated pools bill 24x7 while paused, and pause takes 30+ min). (2) Instant warehouse resize, zero-copy clones, time travel. (3) Broader BI ecosystem. Synapse wins on: (1) Tight integration with ADF, Power BI Premium, Purview out of the box. (2) Reserved capacity discounts when fully utilized. (3) No cross-cloud egress if your estate is 100% Azure. In 2026, most teams pick Snowflake on Azure for analytics and use Synapse Pipelines/Serverless for orchestration + ad-hoc.
Q: Fabric OneLake vs Databricks Unity Catalog vs Snowflake Horizon?
OneLake: SaaS data lake tied to Fabric, auto-creates shortcut views across workspaces, good for Microsoft-centric orgs. Unity Catalog: Databricks governance across clouds, strong on Delta tables, lineage, fine-grained access. Snowflake Horizon: built into Snowflake, unifies governance + quality + discovery, plus data sharing marketplace. Pick by where your primary workloads run: Fabric if all-Microsoft, UC if Databricks-primary, Horizon if Snowflake-primary.
Q: When use Data Factory vs Synapse Pipelines vs Fabric Data Pipelines?
They are the same engine under different brandings. ADF: standalone, most mature, wider connector set. Synapse Pipelines: ADF embedded in Synapse workspace. Fabric Data Pipelines: ADF embedded in Fabric. Pick by where the rest of your stack lives — avoid mixing. For pure pipeline features (triggers, copy, data flows, orchestration) they are equivalent.
Fabric is the unified SaaS successor: OneLake (data lake) + Data Factory (pipelines) + Synapse (warehouse + Spark + real-time) + Power BI, all on a shared capacity-based SKU. Synapse and ADF continue to exist for customers who need PaaS control (VNet injection, specific SKUs). Rule of thumb: Fabric for greenfield Microsoft-centric analytics, Synapse/ADF when you need fine-grained networking/compute control or integration with non-Microsoft tools.
Gen1 is deprecated (migrate by Feb 2024 deadline passed). Blob Storage is the underlying object store — cheap, flat namespace, no ACLs. ADLS Gen2 is Blob Storage with hierarchical namespace enabled — adds directories, ACLs, much better Spark/Hadoop performance. Always use Gen2 for data lakes. Hierarchical namespace must be enabled at account creation (cannot be added later).
Databricks wins on: Delta Lake, MLflow, Unity Catalog, auto-scaling, broader Spark feature set, Photon engine. Synapse Spark wins on: integration with Synapse SQL (single workspace), no separate billing, included in Synapse capacity. Most serious Spark teams pick Databricks; Synapse Spark is fine for light integration scenarios already in Synapse.
Options: (1) Event Hubs + Stream Analytics for SQL-like stream processing with ADLS sink. (2) Event Hubs + Databricks Structured Streaming for complex streams into Delta Lake. (3) Event Hubs Kafka surface + Snowflake Kafka connector (Snowpipe Streaming) for sub-10s ingestion into Snowflake. (4) IoT Hub for device telemetry, routes to Event Hubs. Pick based on destination and processing complexity.
ADF / Synapse Pipelines for most data ETL — cheap, no infra, rich connectors. Azure Data Factory managed Airflow for teams needing Airflow DAGs (dbt Cloud, cross-cloud, complex Python). Logic Apps for business process flows (approvals, emails, SaaS connectors). Azure Functions Durable for lightweight workflows. Use ADF as the default and only move to Airflow when you have existing DAGs or need Python-heavy orchestration.
Layer: (1) RBAC at the storage account / container: Storage Blob Data Reader / Contributor. (2) ACLs on specific paths for finer control. (3) Entra ID groups (never individual users). (4) Managed Identity for services (ADF, Databricks, Synapse) — never shared keys. (5) Azure Policy to prevent creation of public containers. (6) Defender for Storage to detect anomalies. Purview for auto-classification.
Yes — Snowflake runs natively on Azure regions. Pattern: ADLS as landing zone → Snowflake external stage (Azure SAS or managed identity) → Snowpipe for auto-ingest → dbt or Streams/Tasks for transformation. Use Entra ID federation so Snowflake logins use the same identities as Azure. Private Link between ADF and Snowflake avoids public egress charges.