50 Data Engineer Interview Questions for 2026 (with Real Answers)
The exact questions data hiring managers ask in 2026 — SQL, Python, Spark, Airflow, dbt, Snowflake, Databricks, streaming, modeling, and the production scenarios that close the offer.

Table of Contents
Data engineering interviews in 2026 lean on three layers: SQL fluency, distributed-systems thinking, and the modern data stack (Snowflake / Databricks / BigQuery + dbt + Airflow). Expect technical screens with live SQL coding, a system-design round on a hypothetical pipeline, and a "tell me about a pipeline that broke" behavioural round. This list covers all three.
Tip: The biggest junior-vs-senior gap in data interviews is idempotency. Senior engineers always ask "what happens if this job runs twice?" Practice phrasing every answer in those terms.
How to Prepare
- Build a real end-to-end pipeline. S3/ADLS → Spark or dbt → warehouse table → dashboard. Orchestrated by Airflow or Dagster. Even a small project is interview gold.
- SQL is everything. Window functions, CTEs, recursive queries, anti-joins, dedup patterns. Practice on StrataScratch / LeetCode SQL section.
- Read the AWS DEA-C01 syllabus. Our DEA-C01 guide is the AWS-flavour version of the modern data stack.
- Practice with our free DEA-C01 questions — scenario style overlaps heavily with data interviews.
- Bring a "broken pipeline" story. Detection, mitigation, root cause, long-term fix.
SQL Fundamentals (Q1-Q9)
Q1. Difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN?
INNER returns rows present in both tables. LEFT returns all left rows + matching right. RIGHT returns all right rows + matching left. FULL OUTER returns rows from both, with NULLs where unmatched. Most pipelines use LEFT JOIN by default to preserve the fact table even when dimensions are missing.
Q2. Write a query to find duplicates.
SELECT col, COUNT(*) FROM t GROUP BY col HAVING COUNT(*) > 1. To return the actual duplicate rows: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) rn FROM t) WHERE rn > 1.
Q3. Difference between WHERE and HAVING?
WHERE filters rows before aggregation. HAVING filters groups after aggregation. WHERE status = 'paid' drops unpaid rows; HAVING SUM(amount) > 1000 drops groups whose total is too small.
Q4. What are window functions?
Functions that operate on a window of rows relative to the current row without collapsing the result set. ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER (), AVG() OVER (). Critical for deduplication, sessionization, running totals, and SCD Type 2 implementation.
Q5. RANK vs DENSE_RANK vs ROW_NUMBER?
ROW_NUMBER: 1,2,3,4 — always unique. RANK: 1,2,2,4 — gaps after ties. DENSE_RANK: 1,2,2,3 — no gaps. ROW_NUMBER is the dedup workhorse.
Q6. What is a CTE (WITH clause)?
Named subquery scoped to a single SQL statement. Improves readability vs nested subqueries. Recursive CTEs traverse hierarchies (org charts, BOM, graph paths). Modern warehouses compile CTEs to the same plan as inlined subqueries — use them for clarity, not performance.
Q7. How would you find the 2nd highest salary per department?
SELECT * FROM (SELECT *, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) r FROM emp) WHERE r = 2. Mention edge cases: ties, NULL salaries, departments with only one employee.
Q8. EXISTS vs IN vs JOIN for filtering?
All three can work. EXISTS is often most efficient for "is there at least one match" semantics. IN with a subquery returning many rows can blow up. JOIN risks fan-out if the right side has duplicates. For data engineering, prefer EXISTS or LEFT JOIN + IS NOT NULL.
Q9. How do you optimize a slow query?
EXPLAIN ANALYZE first — never optimize without seeing the plan. Look for: full table scans on large tables (add an index or partition prune), nested loops on huge data (rewrite as hash join), missing partition filters in a partitioned warehouse, redundant subqueries, broadcast joins not happening for small dim tables. Cluster / sort the table on the most filtered column.
Data Modeling & Warehousing (Q10-Q16)
Q10. Star schema vs Snowflake schema?
Star: one central fact table with denormalized dimension tables. Snowflake: dimensions further normalized into related tables. Star is simpler, faster, the default for analytics. Snowflake is normalized but slower due to extra joins. Most modern warehouses use star.
Q11. Fact tables — what types?
Transactional fact (one row per event), Periodic snapshot fact (one row per dimension combination per time period), Accumulating snapshot fact (one row per workflow with timestamps for each milestone). Pick based on the business question.
Q12. What is a slowly changing dimension (SCD)?
Pattern for tracking changes to dimension attributes over time. Type 1: overwrite (no history). Type 2: insert new row with effective_from / effective_to dates (full history). Type 3: prior value in a separate column (limited). SCD Type 2 is the most common in production.
Q13. What is dimensional modeling (Kimball) vs Data Vault?
Kimball: business-facing star schemas optimized for query. Data Vault: hub-link-satellite normalized model optimized for ingestion and auditability, with star schemas built downstream. Kimball wins for analytical simplicity; Data Vault for highly regulated enterprises with many source systems.
Q14. Data warehouse vs data lake vs lakehouse?
Warehouse: cleaned, typed, SQL-optimized (Snowflake, Redshift, BigQuery). Lake: raw data on object storage. Lakehouse: warehouse-grade ACID, schema, and indexing on top of the lake (Apache Iceberg + Athena/Trino, Databricks Delta Lake, Snowflake on Iceberg). Lakehouse is the 2026 default direction.
Q15. What is Apache Iceberg and why does it matter?
Open table format for huge analytic datasets. Adds ACID, schema evolution, partition evolution, and time travel to Parquet on object storage. Reads work across Spark, Trino, Athena, Snowflake, Databricks, BigQuery. Eliminates vendor lock-in for warehouse-grade data on a lake. See our Apache Iceberg skills guide.
Q16. Partitioning vs clustering?
Partitioning physically splits a table into chunks based on a column (typically a date) — queries that filter on that column skip whole partitions. Clustering / sort keys reorder data within a partition (or table) to improve compression and selective reads. Most warehouses recommend partitioning by date + clustering by the most common filter column.
ETL / ELT & Orchestration (Q17-Q23)
Q17. ETL vs ELT?
ETL transforms in a separate engine then loads. ELT loads raw data first, transforms in the warehouse. ELT now dominates because cloud warehouse compute is cheap and storing raw data preserves the ability to re-derive transformations later.
Q18. What is dbt and how does it work?
dbt (data build tool) compiles SQL files into a DAG of materialized models (views, tables, incremental tables, ephemeral CTEs) inside a warehouse. Adds tests, documentation, lineage, source freshness checks. The 2026 default tool for the "T" in ELT.
Q19. What is idempotency in a pipeline?
Property that re-running the pipeline produces the same result. Critical for retries, backfills, and recovery. Common patterns: MERGE / UPSERT instead of INSERT-only; deterministic primary keys; partition-overwrite instead of append; full-refresh weekly even with incremental daily. Every senior data engineer answer should mention idempotency.
Q20. Airflow — what is a DAG?
Directed Acyclic Graph. Defines a workflow of tasks and their dependencies. Each task is an Operator (BashOperator, PythonOperator, SparkSubmitOperator, etc). Scheduled by Airflow Scheduler; executed by Workers via Executors (Celery, Kubernetes, Local).
Q21. Airflow vs Dagster vs Prefect — how to pick?
Airflow: oldest, biggest community, declarative DAGs in Python. Dagster: newer, opinionated software-engineering approach, strong asset-based model. Prefect: modern, dynamic workflows, hybrid execution. Pick Airflow for existing ecosystems, Dagster or Prefect for greenfield projects starting in 2026.
Q22. What is backfilling?
Re-running a pipeline for historical dates — usually to recover from a bug, schema change, or new source. Backfills are where pipeline correctness pays off: idempotent jobs backfill cleanly; non-idempotent ones double-count or corrupt data. Airflow has built-in backfill support; Dagster has partition-based backfills.
Q23. CDC (Change Data Capture) — what is it?
Streaming changes (insert / update / delete) from a source database to a downstream system without full table reloads. Implementations: log-based (Debezium reads the database WAL / binlog), trigger-based (sketchy), query-based (poll incremental column). Log-based is the production standard.
Spark & Distributed Compute (Q24-Q29)
Q24. What is Apache Spark?
Distributed compute engine for big data, with APIs in Scala, Python (PySpark), R, and SQL. Built around the DataFrame abstraction. Replaced Hadoop MapReduce for most workloads. Runs on YARN, Kubernetes, Databricks, EMR, Synapse, and standalone.
Q25. RDD vs DataFrame vs Dataset?
RDD: low-level, untyped, no Catalyst optimizer. DataFrame: tabular, optimized by Catalyst, the standard. Dataset: typed DataFrame (Scala only). For most data engineering, use DataFrames.
Q26. Spark transformations vs actions?
Transformations (map, filter, join, groupBy) are lazy — they define the plan but do not execute. Actions (count, collect, show, write) trigger execution. Spark optimizes the entire plan when an action is called.
Q27. What is shuffling and how do you minimize it?
Shuffling = redistributing data across the cluster, usually triggered by wide transformations (groupBy, join, distinct). It is expensive (network + disk). Minimize by: broadcasting small dim tables (broadcast joins), pre-partitioning data on the join key, using bucketing for repeated joins, avoiding unnecessary groupBys.
Q28. What is the difference between cache and persist?
cache() = persist(MEMORY_ONLY). persist() lets you specify storage level (memory, disk, both, serialized). Cache the result of an expensive computation that is reused. Always unpersist when done to release resources.
Q29. Skewed joins — how do you handle them?
A skewed key (one value with vastly more rows than others) makes one executor work much harder than the rest. Mitigations: salt the skewed key (add a random suffix, join on multiple, then re-aggregate); use broadcast join if one side is small; in Spark 3+ enable Adaptive Query Execution (AQE) which detects and handles skew automatically.
Streaming (Q30-Q34)
Q30. Batch vs streaming?
Batch: process bounded data periodically (hourly, daily). Streaming: process unbounded data continuously. Batch is cheaper, simpler, fine for most analytics. Streaming is needed for real-time use cases: fraud detection, alerting, personalization with sub-minute freshness.
Q31. What is Apache Kafka?
Distributed event streaming platform. Topics are partitioned and replicated logs. Producers write events, consumers read them. Durable (events persist), replayable, scales horizontally. Default backbone for streaming pipelines and event-driven architectures.
Q32. Exactly-once vs at-least-once vs at-most-once?
At-most-once: events may be lost but never duplicated. At-least-once: events never lost but may be duplicated — idempotent consumers required. Exactly-once: events delivered once exactly — achievable in modern Kafka with transactions and idempotent producers. Default to at-least-once + idempotent processing for most pipelines.
Q33. Event time vs processing time?
Event time: when the event actually happened. Processing time: when the streaming engine processes it. They differ due to network delays and late-arriving data. Stream processors (Flink, Spark Structured Streaming) handle event time via watermarks — an estimate of how late data can still arrive.
Q34. Apache Flink vs Spark Structured Streaming?
Flink: true streaming (event-at-a-time), state-of-the-art for low-latency stateful processing. Spark Structured Streaming: micro-batching by default, simpler if you already use Spark for batch. Pick Flink for sub-second latency requirements; Spark Structured Streaming if you want unified batch + streaming on one engine.
Cloud Data Stacks (Q35-Q40)
Q35. Snowflake vs BigQuery vs Redshift vs Databricks?
Snowflake: cloud-only, separates compute from storage, multi-cluster auto-scaling, easy multi-cloud. BigQuery: serverless, per-query pricing, Google ecosystem. Redshift: AWS-native, integrates well with the rest of AWS, RA3 + Serverless options. Databricks: lakehouse on Delta Lake / Iceberg, strong for ML + ETL on the same data. Choose based on existing cloud, latency, and team skills.
Q36. Snowflake virtual warehouse — what is it?
Compute cluster you assign to queries / workloads. Independent of storage. Different warehouses can run on the same data without contention. Suspend when idle to save cost; auto-resume on next query. Scale horizontally (multi-cluster) for concurrency, vertically (size) for query speed.
Q37. AWS Glue — what does it do?
Managed ETL service with three main components: Glue Catalog (metadata store, like Hive Metastore), Glue Crawlers (auto-discover schema), Glue Jobs (run Spark or Python ETL). Pairs well with Athena, Redshift Spectrum, and EMR. Note: Glue Studio is the visual builder layer.
Q38. Athena — when do you use it?
Serverless SQL on data in S3. Pay per query (bytes scanned). Good for ad-hoc analysis on data lake files, occasional reports, and quick exploration without provisioning a warehouse. Use partitioning + columnar formats (Parquet) to keep cost low.
Q39. Databricks Photon — what is it?
Vectorized C++ query engine that replaces JVM execution for SQL and DataFrame workloads on Databricks. 2-12x faster for many workloads, especially scans and aggregations. Use it for SQL warehouse workloads; less benefit for pure Python UDF-heavy jobs.
Q40. What is Unity Catalog?
Databricks' unified governance layer across workspaces, clouds, and data assets. Manages access control, lineage, and discovery for tables, files, ML models, and notebooks. The 2026 successor to Hive Metastore inside Databricks.
Data Quality & Observability (Q41-Q44)
Q41. How do you ensure data quality in a pipeline?
Define expectations as code (Great Expectations, dbt tests, Soda). Run on every load. Alert on violations. Specific checks: row counts within range, no nulls in required columns, primary key uniqueness, referential integrity, business metric within expected bounds (e.g. revenue today ± 30% of 7-day average). Reject or quarantine bad data rather than silently passing it on.
Q42. What is data observability?
Treat data pipelines the same as production services: monitor freshness, volume, schema, distribution, and lineage. Tools: Monte Carlo, Datafold, Bigeye, OpenLineage. Catches silent data issues that traditional tests miss (e.g. upstream API started returning timestamps in a different timezone).
Q43. What is data lineage and why does it matter?
Lineage is the dependency graph from source to consumed dataset. Critical for impact analysis ("if I change this column, what dashboards break?") and incident response. OpenLineage is the open standard; tools include Marquez, DataHub, Atlan.
Q44. Data SLAs — what are they?
Service-level agreements on data freshness, completeness, and quality. Example: "Daily fact_orders table will be available by 06:00 UTC, with row count within 80-120% of the 7-day average." Document these; alert when violated; review with stakeholders quarterly.
Production Scenarios (Q45-Q50)
Q45. The daily ETL job is missing its SLA. Walk me through.
Check the orchestrator (Airflow / Dagster) for the failure point. If a job failed: retry, read the logs. If jobs succeeded but slow: profile the slowest job (Spark UI, warehouse query history). Common causes: data volume spike, broken partition pruning, expensive new join, dimension table grew, cluster downsized. Mitigate with short-term scale-out; fix root cause.
Q46. Design a real-time pipeline for clickstream events.
App / SDK → Kafka / Kinesis → Stream processor (Flink / Spark Structured Streaming) for sessionization and aggregation → OLAP store (ClickHouse / Druid / Pinot) for sub-second analytical queries + S3 / lake for raw long-term. Schema registry for evolution. Idempotent consumers. Watermarks for late events. Monitoring on per-topic lag.
Q47. You discover a bug that under-reported revenue for 3 months. What do you do?
(1) Communicate immediately to data leadership and finance — do not hide. (2) Quantify impact: rows affected, magnitude, downstream dashboards / reports / decisions that used the wrong numbers. (3) Fix the code with a test that would have caught it. (4) Backfill the corrected data with idempotent rerun. (5) Write a post-mortem; add the missing data quality check.
Q48. How do you handle PII in a data lake?
Classify at ingestion (column-level metadata). Encrypt at rest with separate keys for PII tables (KMS / Cloud KMS). Restrict access via column-level security (Snowflake, BigQuery, Unity Catalog, Lake Formation). Hash or tokenize where joins are still needed without raw values. Audit access via query logs. GDPR / CCPA: implement deletion workflows.
Q49. The warehouse bill is up 80% MoM. Where do you look?
Query history grouped by warehouse, user, and tag. Look for: a new pipeline doing full scans, a long-running notebook left open, a dashboard refresh hitting expensive joins, a poorly-clustered table, unmaterialized expensive views. Set query timeouts, add resource monitors, materialize hot models, partition / cluster the largest tables.
Q50. Tell me about a pipeline that broke.
Have one ready. Use STAR. Cover: detection (alert that fired or escalation that came in), early hypothesis vs actual root cause, decision to mitigate (rollback, manual fix, skip a day), communication with stakeholders, the fix you shipped, the test that would now catch it, what you would change going forward.
Practice DEA-C01 with Free AI Questions
AWS Data Engineer Associate — scenario style questions that mirror real interviews.
Try DEA-C01 Practice ExamFrequently Asked Questions
What is the difference between ETL and ELT?
ETL transforms in a separate engine before loading. ELT loads raw data first, transforms in the warehouse. ELT is now dominant for cloud warehouses.
What is the difference between a data warehouse and a data lake?
Warehouse: cleaned, structured, SQL-optimized. Lake: raw data on object storage. Lakehouses (Iceberg, Delta) combine both.
What is a slowly changing dimension (SCD)?
Pattern for tracking dimension changes. Type 1 overwrites; Type 2 inserts new row with effective dates (full history); Type 3 keeps prior value in a separate column.
What is Apache Iceberg?
Open table format for analytic data on object storage. Adds ACID, schema evolution, partition evolution, time travel. The 2026 default for lakehouses.
How do you handle data quality in a pipeline?
Define expectations as code (Great Expectations, dbt tests, Soda). Run on every load. Alert on violations. Treat data quality like unit tests.
Land the Data Engineer Role
Free tools to plan your interview prep and certification roadmap.
