IntermediateLast updated: 2026-04-16 • 6 sections
Essential AWS services for data engineering: S3, Glue, Lambda, Step Functions, Athena, IAM, and common architecture patterns with cost tips.
| Service | Purpose | When to Use |
|---|---|---|
| S3 | Object storage (the data lake) | All raw data, Parquet/Iceberg tables, intermediate files |
| Glue Data Catalog | Hive-compatible metastore | Central schema store for Athena, EMR, Redshift Spectrum |
| Glue ETL | Serverless Spark jobs | PySpark ETL when you want managed clusters |
| Glue Crawlers | Auto-discover schemas | New sources where schema is unknown / evolving |
| Lambda | Short-lived (<15 min) serverless functions | S3 triggers, API handlers, lightweight ETL |
| Step Functions | Serverless workflow orchestration | ETL DAGs without managing Airflow |
| Athena | Serverless SQL on S3 | Ad-hoc queries, $5/TB scanned |
| Redshift / Redshift Serverless | Cloud data warehouse | BI, large analytics workloads |
| EMR / EMR Serverless | Managed Hadoop/Spark | Heavy Spark jobs, full cluster control |
| MSK / Kinesis | Streaming (Kafka / proprietary) | Real-time ingestion pipelines |
| IAM | Identity & access control | Everything — least-privilege roles, service roles |
| Secrets Manager | Credential storage with rotation | DB passwords, API keys |
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Data Catalog (pushdown predicate for partition pruning)
src = glueContext.create_dynamic_frame.from_catalog(
database="raw", table_name="orders",
push_down_predicate=f"dt = '{args['run_date']}'"
)
df = src.toDF()
clean = (
df.filter(F.col("amount") > 0)
.withColumn("dt", F.lit(args['run_date']))
)
# Write to curated zone as Parquet
clean.write.mode("append").partitionBy("dt").parquet(
"s3://analytics-curated/orders/"
)
job.commit(){
"Comment": "Daily orders ETL",
"StartAt": "Extract",
"States": {
"Extract": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": { "JobName": "extract-orders", "Arguments": { "--run_date.$": "$.run_date" } },
"Retry": [{ "ErrorEquals": ["States.ALL"], "IntervalSeconds": 30, "MaxAttempts": 3, "BackoffRate": 2.0 }],
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "NotifyFailure" }],
"Next": "Transform"
},
"Transform": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": { "JobName": "transform-orders" },
"Next": "Load"
},
"Load": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "load-to-snowflake", "Payload.$": "$" },
"End": true
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": { "TopicArn": "arn:aws:sns:us-east-1:123:etl-alerts", "Message.$": "$" },
"End": true
}
}
}{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadRawOrdersOnly",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::raw-zone/orders/*",
"arn:aws:s3:::raw-zone"
],
"Condition": { "StringLike": { "s3:prefix": "orders/*" } }
},
{
"Sid": "WriteCuratedOrdersOnly",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::curated-zone/orders/*"
},
{
"Sid": "GlueCatalogRead",
"Effect": "Allow",
"Action": ["glue:GetTable", "glue:GetPartitions", "glue:GetDatabase"],
"Resource": "*"
}
]
}Q: Why is my Athena bill huge?
Athena charges $5 per TB scanned. Huge bills almost always mean scanning unpartitioned data or SELECT * on Parquet without column pruning. Fix: (1) Partition by date and filter on partition columns. (2) SELECT only needed columns. (3) Convert CSV to Parquet. (4) Use CTAS to create partitioned summary tables. (5) Set data_scanned_cutoff_per_query workgroup limit. Typical reduction: 10-100x.
Q: S3 costs exploding on a data lake?
Usually three causes: (1) Too many small files — S3 charges per request, and Parquet metadata overhead blows up. Compact to 128MB+ files. (2) Storage class not tuned — use Intelligent-Tiering for ambiguous access. (3) Old raw data never deleted — add lifecycle rules: raw → Glacier at 90d, delete at 365d. (4) Cross-region reads charge $0.02/GB — keep compute in same region as data.
Q: Lambda + S3 event storm?
S3 events can fan out to Lambda at huge volume when uploading many files. Mitigate: (1) Use SQS between S3 and Lambda for batching and backpressure. (2) Set Lambda reserved concurrency to cap parallelism. (3) Use S3 event batching (batch size, window). (4) For ETL, prefer EventBridge Scheduler + Glue job over per-file Lambda.
Lambda: < 15 min runtime, < 10GB memory, event-driven (S3 trigger, API). Glue ETL: serverless Spark, no cluster management, best for scheduled ETL with moderate data. EMR Serverless: larger/longer Spark jobs, more tuning control, needs custom libraries or specific Spark versions. EMR on EC2: full control, long-running clusters, best for continuous streaming or Hadoop ecosystem (HBase, Presto).
Athena: serverless, $5/TB scanned, best for ad-hoc queries on S3. Redshift Spectrum: query S3 from Redshift, same pricing, integrates with Redshift tables. Snowflake on AWS: separate compute/storage, auto-clustering, result cache, best performance and governance but higher per-query cost. Pick Athena for infrequent ad-hoc, Snowflake for production analytics with BI tools, Spectrum if already on Redshift.
Layer defenses: (1) Bucket policies + IAM least-privilege roles. (2) Block all public access at account level. (3) Default encryption SSE-KMS with customer-managed keys. (4) Enable CloudTrail + S3 Access Logs + GuardDuty S3 protection. (5) Use Lake Formation for column/row-level access control on Glue Catalog. (6) VPC endpoints so traffic never leaves AWS network. (7) Rotate IAM keys, prefer IAM Roles over keys.
Step Functions: best for pure AWS-service orchestration (Glue + Lambda + SNS), cheapest, no infra to manage. MWAA (managed Airflow): best for complex DAGs, cross-cloud, rich operator ecosystem, but costs $300-1000/mo baseline. Glue Workflows: tight integration with Glue jobs but limited outside Glue. Use Step Functions for simple-to-medium pipelines, MWAA when you need cross-cloud or dbt Cloud integration.
Kinesis Data Streams: AWS-native, shard-based, auto-scaling variant available, simpler ops. MSK: managed Kafka, open protocol, best if you have existing Kafka ecosystem (Kafka Connect, ksqlDB, Schema Registry). SQS: queue not stream, no ordering across partitions, no replay, best for task queues not analytics. Pick Kinesis for new AWS-native streaming, MSK for Kafka compatibility, SQS only for work queues.
Use IAM Identity Center (SSO), no long-lived access keys. Federate via your IdP (Okta, Azure AD). Create permission sets per role: DataEngineer (read all raw, write curated, Glue/Athena), DataScientist (read curated only), Admin (separate MFA-required role). Attach service-linked roles to Glue/Lambda/Step Functions. Use AWS Config + IAM Access Analyzer to flag overly-permissive policies.
Best pattern: land data in S3 as Parquet → Snowflake external stage → Snowpipe auto-ingest on S3 event. For streaming: use Snowflake Kafka Connector in Snowpipe Streaming mode (sub-10s latency). Avoid pushing via Python loops on a single machine — slow and expensive. For migrations of over 1TB, use Snowflake COPY INTO with MAX_FILE_SIZE tuned and a well-sized warehouse.