AWS June 21, 2026 11 min read

AWS ETL Tools Cheat Sheet: Glue, DMS, Kinesis, EMR & Data Pipeline (2026)

Maintained by the ExamCert editorial team • About our team

AWS has at least five overlapping ways to move and transform data: Glue, DMS, Kinesis, EMR and the legacy Data Pipeline. This scannable cheat sheet breaks down what each one does, when to reach for it, and how to pick the right tool fast - the exact judgment the AWS Data Engineer (DEA-C01) exam tests.

AWS ETL Tools Cheat Sheet: Glue, DMS, Kinesis, EMR & Data Pipeline (2026)

1. The AWS ETL Landscape
2. Core ETL Services Cheat Sheet
3. Supporting Data Services
4. Which Tool When? Decision Guide
5. Quick Comparison
6. How This Maps to DEA-C01
7. FAQ

The AWS ETL Landscape

AWS does not have one ETL tool - it has a portfolio of overlapping data-integration services, and knowing which one to reach for is half the battle. Whether you are designing a real production pipeline or sitting the AWS Certified Data Engineer - Associate (DEA-C01) exam, the recurring question is the same: given these requirements, which service is the right fit? This cheat sheet covers the five services most people mean by 'AWS ETL tools' - Glue, DMS, Kinesis, EMR and the legacy Data Pipeline - plus the supporting cast (Lake Formation, Athena, Step Functions and Redshift) you need to know to put them in context.

A quick framing before the cards. ETL splits into three jobs: extract (pull data out of a source), transform (clean, join, reshape) and load (write to a target). AWS services specialise: DMS is built for extract-and-load between databases, Glue and EMR own the transform layer, and Kinesis handles the streaming-ingestion edge. Match the service to the shape of your data - batch vs streaming, database vs files, serverless vs cluster - and the choice usually makes itself.

Core ETL Services Cheat Sheet

These are the five services you will reach for most often. Read each card as: what it is, when to use it, and the key features worth remembering.

AWS Glue Serverless ETL

What it is: A fully serverless data-integration service built on Apache Spark - no clusters to provision, you pay per second of compute. It is the default AWS answer to 'I need to transform data'. When to use it: Batch or micro-batch ETL over data in S3, databases or SaaS apps; building and querying a central metadata catalog; schema discovery; visual no-code pipelines. Key features: the Glue Data Catalog (a Hive-compatible metastore shared by Athena, EMR and Redshift Spectrum); Crawlers that auto-infer schema and partitions; Glue Studio for visual job authoring plus generated PySpark/Scala; Glue Flex execution for cheaper non-urgent batch jobs; 100+ built-in connectors; and native support for open table formats - Apache Iceberg, Hudi and Delta Lake.

AWS DMS Database Migration & CDC

What it is: Database Migration Service - moves data into AWS or between databases with minimal downtime. Despite the name, it is a workhorse for ongoing replication, not just one-time migrations. When to use it: Lifting a database to RDS/Aurora; continuously replicating an operational DB into a data lake or warehouse; keeping a target in near-real-time sync. Key features: Change Data Capture (CDC) reads the source's native transaction logs to stream ongoing inserts/updates/deletes with low impact; homogeneous migrations (MySQL to MySQL) need no conversion, while heterogeneous ones (Oracle to PostgreSQL) pair with the AWS Schema Conversion Tool (SCT); DMS Serverless auto-provisions and scales replication capacity. Targets include S3, Redshift, Kinesis and more.

Amazon Kinesis Real-Time Streaming

What it is: A family of services for ingesting and processing streaming data in real time. The two you must distinguish are Data Streams and Data Firehose. When to use it: Clickstreams, IoT telemetry, logs, change events - anything that arrives continuously and needs sub-minute handling. Key features: Kinesis Data Streams is a durable, replayable stream with sub-second latency, configurable retention from 24 hours up to 365 days, and many consumers reading the same data; use On-Demand mode to skip manual shard management. Kinesis Data Firehose is the zero-admin delivery pipe - it batches, compresses and loads straight into S3, Redshift, OpenSearch or Splunk, with no storage and no replay. Rule of thumb: need replay, ordering or multiple consumers, pick Data Streams; just need to land the data somewhere, pick Firehose.

Amazon EMR Big Data Frameworks

What it is: Managed clusters for open-source big-data engines - Apache Spark, Hadoop, Hive, Presto, Flink and HBase. EMR is the heavy-lifting option when Glue's opinionated model is too constraining. When to use it: Petabyte-scale processing; existing Spark/Hadoop jobs you want to lift-and-shift; fine-grained control over cluster config and tuning; ML pipelines on huge datasets. Key features: four deployment flavours - EMR on EC2 (full cluster control), EMR Serverless (no clusters, auto-scales per job), EMR on EKS (Spark in containers on Kubernetes), and EMR on Outposts (on-prem). Integrates with S3 (via EMRFS), DynamoDB and the Glue Data Catalog. Choose EMR over Glue when you need framework flexibility; choose Glue when you want a managed, catalog-driven pipeline with less to operate.

AWS Data Pipeline Legacy - Avoid for New Work

What it is: The original AWS orchestration service for scheduling and moving data between AWS compute and storage. Status: AWS closed Data Pipeline to new customers in July 2024; existing customers can keep using it, but it receives no meaningful investment. When to use it: Essentially never for greenfield builds - know it for the exam and for recognising it in legacy estates. What to use instead: AWS officially recommends migrating to AWS Glue (Spark-based ETL), AWS Step Functions (serverless workflow orchestration), or Amazon MWAA - Managed Workflows for Apache Airflow (complex Python-defined DAGs). If a scenario mentions Data Pipeline and asks for a modern replacement, map it to one of those three.

Supporting Data Services

These are not ETL engines themselves, but they show up constantly in AWS data architectures and on the DEA-C01 exam. Know where each one sits.

AWS Lake Formation - sits on top of the Glue Data Catalog and S3 to add fine-grained, governed access control (table, column, row and cell-level permissions). It is the security and governance layer for your data lake, not a transform tool.
Amazon Athena - serverless, pay-per-query SQL directly over S3 using the Glue Data Catalog. Ideal for ad-hoc analysis and for validating ETL output without standing up infrastructure.
AWS Step Functions - a serverless orchestrator that strings Glue jobs, Lambda functions and EMR steps into a coordinated state machine. The modern, low-overhead replacement for Data Pipeline's scheduling role.
Amazon Redshift - the cloud data warehouse and the most common load target. Redshift features (zero-ETL ingestion from Aurora, federated queries, Spectrum over S3) increasingly blur the line between warehouse and ETL.

Mental model: Kinesis and DMS extract, Glue and EMR transform, Redshift and S3 load, Step Functions / MWAA orchestrate, and Lake Formation governs the whole thing. Most real pipelines chain several of these together rather than relying on one.

Which Tool When? Decision Guide

Use this as a fast lookup when a scenario lands on your desk - or on your exam screen.

'Move a database into AWS or replicate it with low downtime' - AWS DMS (add SCT if the engines differ).
'Serverless batch ETL with a metadata catalog and minimal ops' - AWS Glue.
'Ingest a continuous real-time stream that needs replay or many consumers' - Kinesis Data Streams.
'Just deliver streaming data straight to S3/Redshift/OpenSearch, no code' - Kinesis Data Firehose.
'Run existing Spark/Hadoop jobs or process petabytes with full control' - Amazon EMR (Serverless if you want no clusters).
'Orchestrate a multi-step pipeline across AWS services' - Step Functions (simple) or MWAA / Airflow (complex DAGs).
'Ad-hoc SQL over data already in S3' - Athena.
'Govern who can see which rows/columns in the lake' - Lake Formation.
'A legacy job still on AWS Data Pipeline' - migrate to Glue, Step Functions or MWAA.

The most-tested distinctions are Glue vs EMR (managed/serverless ETL vs flexible big-data clusters), Data Streams vs Firehose (replayable buffer vs fire-and-forget delivery) and DMS vs Glue (database replication/CDC vs general transformation). If you can articulate those three trade-offs cleanly, you can answer most ETL questions. The same architectural-judgment muscle is rewarded across associate-level AWS exams, including the Solutions Architect - Associate (SAA-C03).

Quick Comparison

One screen, all five core services side by side:

AWS Glue - Serverless Spark ETL | Batch / micro-batch | Best for: catalog-driven transforms, schema discovery, no-code pipelines.
AWS DMS - Database replication | Batch + CDC | Best for: migrations and ongoing DB-to-lake/warehouse sync.
Kinesis - Streaming ingestion | Real-time | Best for: clickstream, IoT, logs; Streams for replay, Firehose for delivery.
Amazon EMR - Managed big-data clusters | Batch + streaming | Best for: large-scale or custom Spark/Hadoop/Flink workloads.
Data Pipeline - Legacy orchestration | Batch | Best for: nothing new - migrate to Glue / Step Functions / MWAA.

Master AWS data engineering

Cards are great for recall, but the DEA-C01 tests scenario judgment. Drill realistic questions on Glue, DMS, Kinesis and EMR until picking the right service is automatic.

Free DEA-C01 Practice

How This Maps to DEA-C01

The AWS Certified Data Engineer - Associate (DEA-C01) exam is built almost entirely around these services. Its four domains are Data Ingestion and Transformation (34%), Data Store Management (26%), Data Operations and Support (22%) and Data Security and Governance (18%). The cheat sheet above maps straight onto the heaviest-weighted domain: ingestion and transformation is where Glue, DMS, Kinesis and EMR live, and it alone is over a third of your score.

You will see questions that hand you a set of constraints - latency, data volume, source type, cost ceiling, ops budget - and ask which service or combination fits. Cost optimisation runs through the whole exam, so know the cheaper modes: Glue Flex for non-urgent batch, Kinesis On-Demand vs provisioned shards, EMR Serverless vs always-on clusters, and DMS Serverless. Treat every 'which tool when' card here as a question template and you are studying the exam directly. Use the practice questions linked above to pressure-test that recall under exam conditions.

Frequently Asked Questions

What is the difference between AWS Glue and AWS DMS?

AWS Glue is a general-purpose serverless ETL service for transforming data - cleaning, joining and reshaping it - with a built-in Data Catalog. AWS DMS is purpose-built to extract and load data between databases, including continuous Change Data Capture (CDC) replication. Use DMS to get a database into AWS or keep a target in sync; use Glue to transform that data once it lands. They are often used together: DMS replicates into S3, Glue transforms it.

Is AWS Data Pipeline still available in 2026?

Existing customers can still use it, but AWS closed it to new customers in July 2024 and it is effectively in legacy maintenance. Do not build new workloads on it. AWS recommends migrating to AWS Glue (Spark ETL), AWS Step Functions (serverless orchestration) or Amazon MWAA (managed Apache Airflow) depending on the use case. The exam may still reference it, usually to test whether you know the modern replacement.

When should I use EMR instead of Glue?

Choose Amazon EMR when you need framework flexibility or control - existing Spark/Hadoop/Flink jobs, custom libraries, petabyte-scale tuning, or workloads that do not fit Glue's opinionated, catalog-driven model. Choose Glue when you want a managed, mostly serverless ETL pipeline with minimal operational overhead and tight Data Catalog integration. EMR Serverless narrows the gap by removing cluster management, so the real decision is opinionated-managed (Glue) vs flexible-engine (EMR).

What is the difference between Kinesis Data Streams and Firehose?

Kinesis Data Streams is a durable, replayable real-time stream with sub-second latency, configurable retention (24 hours to 365 days) and support for multiple independent consumers - use it when you need replay, ordering or several apps reading the same data. Kinesis Data Firehose is a fully managed delivery service that batches and loads straight into S3, Redshift, OpenSearch or Splunk with no storage and no replay - use it when you simply need to land streaming data somewhere and can tolerate buffering of a minute or so.

ExamCert Team

Certified IT professionals creating honest, up-to-date exam preparation content. Updated regularly to match current exam objectives.

Prepare the Honest Way and Pass First Time

Practice with realistic questions and detailed explanations across 170+ certification exams. 100% money-back guarantee.

Browse Practice Tests More Articles

Table of Contents