AWS for Data Engineers Cheat Sheet

IntermediateLast updated: 2026-04-16 • 6 sections

Essential AWS services for data engineering: S3, Glue, Lambda, Step Functions, Athena, IAM, and common architecture patterns with cost tips.

Core Services Map

ServicePurposeWhen to Use
S3Object storage (the data lake)All raw data, Parquet/Iceberg tables, intermediate files
Glue Data CatalogHive-compatible metastoreCentral schema store for Athena, EMR, Redshift Spectrum
Glue ETLServerless Spark jobsPySpark ETL when you want managed clusters
Glue CrawlersAuto-discover schemasNew sources where schema is unknown / evolving
LambdaShort-lived (<15 min) serverless functionsS3 triggers, API handlers, lightweight ETL
Step FunctionsServerless workflow orchestrationETL DAGs without managing Airflow
AthenaServerless SQL on S3Ad-hoc queries, $5/TB scanned
Redshift / Redshift ServerlessCloud data warehouseBI, large analytics workloads
EMR / EMR ServerlessManaged Hadoop/SparkHeavy Spark jobs, full cluster control
MSK / KinesisStreaming (Kafka / proprietary)Real-time ingestion pipelines
IAMIdentity & access controlEverything — least-privilege roles, service roles
Secrets ManagerCredential storage with rotationDB passwords, API keys

S3 Cost & Performance Patterns

  • Use Parquet + Snappy compression — 10x smaller than CSV, 10x faster query
  • Partition keys: date first (year=/month=/day=) then low-cardinality filters — enables partition pruning in Athena/Spark
  • Avoid thousands of tiny files (< 64MB each) — compaction is mandatory; target 128MB-1GB per file
  • Enable S3 Intelligent-Tiering for unknown access patterns — auto-moves cold data to cheaper tiers
  • S3 lifecycle rules: move to Glacier after 90 days, delete raw logs after 365 days
  • Enable S3 Access Logging + GuardDuty for exfiltration detection
  • Use S3 Select / Athena for predicate pushdown instead of scanning full files
  • Multi-part upload for files > 100MB; single PUT limit is 5GB
  • Cross-region replication for DR; S3 Batch Operations for mass reprocessing
  • Request rate: 3500 PUT and 5500 GET per prefix per second — shard prefixes for higher throughput

Glue ETL Job Skeleton (PySpark)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Data Catalog (pushdown predicate for partition pruning)
src = glueContext.create_dynamic_frame.from_catalog(
    database="raw", table_name="orders",
    push_down_predicate=f"dt = '{args['run_date']}'"
)

df = src.toDF()
clean = (
    df.filter(F.col("amount") > 0)
      .withColumn("dt", F.lit(args['run_date']))
)

# Write to curated zone as Parquet
clean.write.mode("append").partitionBy("dt").parquet(
    "s3://analytics-curated/orders/"
)

job.commit()

Step Functions ETL Pattern

{
  "Comment": "Daily orders ETL",
  "StartAt": "Extract",
  "States": {
    "Extract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": { "JobName": "extract-orders", "Arguments": { "--run_date.$": "$.run_date" } },
      "Retry": [{ "ErrorEquals": ["States.ALL"], "IntervalSeconds": 30, "MaxAttempts": 3, "BackoffRate": 2.0 }],
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "NotifyFailure" }],
      "Next": "Transform"
    },
    "Transform": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": { "JobName": "transform-orders" },
      "Next": "Load"
    },
    "Load": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "load-to-snowflake", "Payload.$": "$" },
      "End": true
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": { "TopicArn": "arn:aws:sns:us-east-1:123:etl-alerts", "Message.$": "$" },
      "End": true
    }
  }
}

IAM Least-Privilege Example

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadRawOrdersOnly",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::raw-zone/orders/*",
        "arn:aws:s3:::raw-zone"
      ],
      "Condition": { "StringLike": { "s3:prefix": "orders/*" } }
    },
    {
      "Sid": "WriteCuratedOrdersOnly",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::curated-zone/orders/*"
    },
    {
      "Sid": "GlueCatalogRead",
      "Effect": "Allow",
      "Action": ["glue:GetTable", "glue:GetPartitions", "glue:GetDatabase"],
      "Resource": "*"
    }
  ]
}

Cost Traps to Avoid

Q: Why is my Athena bill huge?

Athena charges $5 per TB scanned. Huge bills almost always mean scanning unpartitioned data or SELECT * on Parquet without column pruning. Fix: (1) Partition by date and filter on partition columns. (2) SELECT only needed columns. (3) Convert CSV to Parquet. (4) Use CTAS to create partitioned summary tables. (5) Set data_scanned_cutoff_per_query workgroup limit. Typical reduction: 10-100x.

Q: S3 costs exploding on a data lake?

Usually three causes: (1) Too many small files — S3 charges per request, and Parquet metadata overhead blows up. Compact to 128MB+ files. (2) Storage class not tuned — use Intelligent-Tiering for ambiguous access. (3) Old raw data never deleted — add lifecycle rules: raw → Glacier at 90d, delete at 365d. (4) Cross-region reads charge $0.02/GB — keep compute in same region as data.

Q: Lambda + S3 event storm?

S3 events can fan out to Lambda at huge volume when uploading many files. Mitigate: (1) Use SQS between S3 and Lambda for batching and backpressure. (2) Set Lambda reserved concurrency to cap parallelism. (3) Use S3 event batching (batch size, window). (4) For ETL, prefer EventBridge Scheduler + Glue job over per-file Lambda.

Frequently Asked Questions

When should I use Glue vs EMR vs Lambda for ETL?

Lambda: < 15 min runtime, < 10GB memory, event-driven (S3 trigger, API). Glue ETL: serverless Spark, no cluster management, best for scheduled ETL with moderate data. EMR Serverless: larger/longer Spark jobs, more tuning control, needs custom libraries or specific Spark versions. EMR on EC2: full control, long-running clusters, best for continuous streaming or Hadoop ecosystem (HBase, Presto).

Athena vs Redshift Spectrum vs Snowflake on AWS?

Athena: serverless, $5/TB scanned, best for ad-hoc queries on S3. Redshift Spectrum: query S3 from Redshift, same pricing, integrates with Redshift tables. Snowflake on AWS: separate compute/storage, auto-clustering, result cache, best performance and governance but higher per-query cost. Pick Athena for infrequent ad-hoc, Snowflake for production analytics with BI tools, Spectrum if already on Redshift.

How do I secure data in an S3 data lake?

Layer defenses: (1) Bucket policies + IAM least-privilege roles. (2) Block all public access at account level. (3) Default encryption SSE-KMS with customer-managed keys. (4) Enable CloudTrail + S3 Access Logs + GuardDuty S3 protection. (5) Use Lake Formation for column/row-level access control on Glue Catalog. (6) VPC endpoints so traffic never leaves AWS network. (7) Rotate IAM keys, prefer IAM Roles over keys.

How should I orchestrate AWS ETL — Step Functions, Airflow, or Glue Workflows?

Step Functions: best for pure AWS-service orchestration (Glue + Lambda + SNS), cheapest, no infra to manage. MWAA (managed Airflow): best for complex DAGs, cross-cloud, rich operator ecosystem, but costs $300-1000/mo baseline. Glue Workflows: tight integration with Glue jobs but limited outside Glue. Use Step Functions for simple-to-medium pipelines, MWAA when you need cross-cloud or dbt Cloud integration.

Kinesis Data Streams vs MSK vs SQS?

Kinesis Data Streams: AWS-native, shard-based, auto-scaling variant available, simpler ops. MSK: managed Kafka, open protocol, best if you have existing Kafka ecosystem (Kafka Connect, ksqlDB, Schema Registry). SQS: queue not stream, no ordering across partitions, no replay, best for task queues not analytics. Pick Kinesis for new AWS-native streaming, MSK for Kafka compatibility, SQS only for work queues.

What is the minimum IAM setup for an AWS data engineer?

Use IAM Identity Center (SSO), no long-lived access keys. Federate via your IdP (Okta, Azure AD). Create permission sets per role: DataEngineer (read all raw, write curated, Glue/Athena), DataScientist (read curated only), Admin (separate MFA-required role). Attach service-linked roles to Glue/Lambda/Step Functions. Use AWS Config + IAM Access Analyzer to flag overly-permissive policies.

How do I move data from AWS to Snowflake efficiently?

Best pattern: land data in S3 as Parquet → Snowflake external stage → Snowpipe auto-ingest on S3 event. For streaming: use Snowflake Kafka Connector in Snowpipe Streaming mode (sub-10s latency). Avoid pushing via Python loops on a single machine — slow and expensive. For migrations of over 1TB, use Snowflake COPY INTO with MAX_FILE_SIZE tuned and a well-sized warehouse.

Related Cheat Sheets

Azure for Data Engineers Cheat SheetDatabricks SQL & Workflows Cheat SheetPython for Data Engineers Cheat SheetApache Airflow Cheat Sheet
← All Cheat Sheets