50 DevOps Engineer Interview Questions for 2026 (with Real Answers)
The exact questions DevOps hiring managers ask in 2026 — CI/CD, Docker, Kubernetes, Terraform, observability, and the production scenarios that separate seniors from juniors.

Table of Contents
DevOps interviews in 2026 lean heavily on real production thinking, not just memorized definitions. Expect a mix of concept questions, "design this pipeline" whiteboarding, and post-incident scenarios where the interviewer wants to see how you actually debug. This list covers all three.
Tip: If you have an AWS DOP-C02, AZ-400, or HashiCorp Terraform Associate certification on your CV, your interviewer will assume you can answer the technical questions. The differentiator becomes the scenario questions (Q45-Q50).
How to Prepare
- Build a real pipeline. A GitHub repo with a Terraform-deployed EKS cluster, a Helm-deployed app, GitHub Actions CI, ArgoCD CD, Prometheus + Grafana monitoring — that one project covers 80% of interview ground.
- Know the tools you list on your CV cold. If your CV says Terraform, you must be able to explain state, locking, modules, workspaces, and how to handle a corrupted state file.
- Practice with our free DOP-C02 questions — scenario-style is exactly what interviews ask.
- Read recent post-mortems. Public incident write-ups (GitHub status page, Cloudflare blog, Slack engineering blog) are the best interview prep material money cannot buy.
DevOps Culture & CI/CD (Q1-Q10)
Q1. What is DevOps?
A culture and set of practices that shorten the feedback loop between writing code and learning whether it works in production. The five "DORA" capabilities — deployment frequency, lead time, change failure rate, MTTR, and reliability — are the measurable outcomes.
Q2. Continuous Integration vs Continuous Delivery vs Continuous Deployment?
CI = every commit is merged to mainline and verified by automated tests. CDelivery = every passing commit produces a release artifact ready to deploy. CDeployment = every passing commit is automatically deployed to production. Most teams are at CDelivery; full CDeployment requires high test coverage and progressive delivery (feature flags, canaries).
Q3. What is a typical CI/CD pipeline?
Pull request → lint → unit tests → build (Docker image) → scan (Snyk, Trivy) → push to registry → deploy to staging → integration tests → manual approval → deploy to production with progressive rollout → post-deploy smoke tests + dashboards. Failures at any stage halt the pipeline.
Q4. What is GitOps?
Git is the single source of truth for both application code and infrastructure declarations. An agent (ArgoCD, Flux) continuously reconciles the cluster state with the git repo. Rollbacks are reverts in git. Pull-based deployment improves security: the cluster pulls changes; CI does not need cluster credentials.
Q5. Trunk-based vs GitFlow branching?
Trunk-based: short-lived feature branches merged to main multiple times per day, behind feature flags. Pairs with CD. GitFlow: long-lived develop / release / hotfix branches with formal releases. Older pattern, falls apart at high commit volumes. Most modern teams use trunk-based.
Q6. What is a blue/green deployment?
Two identical production environments. Traffic shifts from blue (current) to green (new) via a load balancer switch. Rollback is reversing the switch. Higher infra cost (2x compute during the cutover) but instant rollback.
Q7. What is a canary deployment?
Roll out the new version to a small percentage of traffic (1%, 5%, 25%, 50%, 100%) while monitoring error rate and latency. Halt and roll back if metrics regress. Lower infra cost than blue/green and safer than big-bang.
Q8. What is a feature flag?
A runtime toggle that enables / disables code paths without redeploying. Decouples deploy from release: ship dark code to production, enable for 1% of users, observe, expand. Tools: LaunchDarkly, Unleash, GrowthBook, Flagsmith.
Q9. How do you handle secrets in CI/CD?
Never commit secrets. Use a secret manager (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, Doppler). In pipelines, fetch secrets at runtime via OIDC federation (no long-lived credentials). Rotate regularly; scan repos with TruffleHog / Gitleaks.
Q10. What is the difference between artifact and image?
An artifact is any build output stored for deployment (jar, zip, tarball, helm chart). An image is a specifically a container image. All container images are artifacts; not all artifacts are images.
Docker & Containers (Q11-Q17)
Q11. Difference between Docker and a VM?
VMs virtualize hardware — each VM runs its own kernel. Containers virtualize the OS — they share the host kernel and isolate via namespaces and cgroups. Containers are seconds to start vs minutes for VMs, but offer weaker isolation.
Q12. What is in a Dockerfile?
Declarative instructions to build an image: FROM (base image), RUN (build commands), COPY (files into image), ENV (env vars), WORKDIR, EXPOSE (documentation), ENTRYPOINT / CMD (default process). Order matters — each line creates a cache layer.
Q13. What is a multi-stage build?
A Dockerfile with multiple FROM statements. Earlier stages contain the build toolchain (compilers, package managers); the final stage copies only the build output into a minimal base image (distroless, alpine). Drastically reduces final image size and attack surface.
Q14. How do you reduce Docker image size?
Multi-stage builds, distroless or alpine base images, combine RUN commands to reduce layers, .dockerignore to skip unneeded files, remove apt / yum cache (rm -rf /var/lib/apt/lists/*), avoid installing build tools in the runtime stage, use specific tags not :latest.
Q15. What is the difference between CMD and ENTRYPOINT?
ENTRYPOINT sets the executable; CMD provides default arguments. If both are set, CMD args are appended to ENTRYPOINT. Use ENTRYPOINT for a fixed command (e.g. nginx) and CMD for overridable args.
Q16. What is a Docker volume vs bind mount?
Volumes are managed by Docker (named, persistent across container restarts, lifecycle independent). Bind mounts directly map a host path into the container — brittle, depends on host filesystem. Volumes are preferred for production data.
Q17. How do you scan images for vulnerabilities?
Trivy, Grype, Snyk, Anchore in CI pipelines. Fail the build on critical CVEs. Tag images with provenance (Sigstore / cosign) for supply chain attestation. Use minimal base images so there is less to scan.
Kubernetes (Q18-Q26)
Q18. What is a Pod?
Smallest deployable unit in Kubernetes — one or more containers that share network and storage namespaces. Most pods have a single container; sidecars (e.g. log shipper, service mesh proxy) are added as additional containers in the same pod.
Q19. Deployment vs StatefulSet vs DaemonSet?
Deployment: stateless replica pods. StatefulSet: stateful pods with stable identity (pod-0, pod-1) and persistent storage. DaemonSet: one pod per node (used for log shippers, CNI plugins, node exporters).
Q20. What is a Service in Kubernetes?
Stable virtual IP and DNS name that load-balances across a set of pods (selected by labels). Types: ClusterIP (internal only), NodePort (exposed on each node's port), LoadBalancer (cloud LB), ExternalName (DNS CNAME).
Q21. What is an Ingress?
API object that defines Layer 7 routing rules to services (host/path-based). Requires an Ingress Controller (NGINX, Traefik, AWS ALB controller, Azure Application Gateway) to actually implement the rules. The Gateway API is the newer replacement — expect interviewers to ask about both in 2026.
Q22. What is a ConfigMap vs a Secret?
Both store key-value config and can be mounted as files or env vars. Secrets are base64-encoded by default (not encrypted unless you enable encryption at rest). For real secrets use a secret manager + External Secrets Operator or CSI driver.
Q23. What is Horizontal Pod Autoscaler (HPA)?
Scales pod replicas based on CPU, memory, or custom metrics. Different from Cluster Autoscaler (adds/removes nodes) and VPA (Vertical Pod Autoscaler, resizes pod resource requests). KEDA extends HPA with event-driven scaling (queue length, Kafka lag, cron).
Q24. What is a Helm chart?
Templated Kubernetes manifests packaged for reuse. values.yaml provides defaults; users override per environment. Chart releases are versioned and rollback-able. The standard packaging format for off-the-shelf applications (Prometheus, ArgoCD, ingress controllers).
Q25. How does a Pod reach a service in another namespace?
Use the fully qualified DNS name: service.namespace.svc.cluster.local. Within the same namespace, the short name works (just service). Verify NetworkPolicies do not block cross-namespace traffic.
Q26. What is a PodDisruptionBudget?
Specifies the minimum number (or percentage) of pods that must remain available during voluntary disruptions (node drains, evictions). Prevents rolling updates or node maintenance from taking down more pods than the application can tolerate.
Terraform & IaC (Q27-Q33)
Q27. What is Terraform state and where do you store it?
JSON file mapping declared resources to real infrastructure IDs. Never store locally for team use. Use a remote backend (S3 + DynamoDB locking, Terraform Cloud, Azure Storage with blob lease, GCS) with encryption at rest and access control.
Q28. What is state locking and why does it matter?
Prevents two engineers / pipelines from simultaneously modifying state and corrupting it. S3 backend pairs with a DynamoDB table for locking; Terraform Cloud handles it automatically. Without locks, concurrent applies cause split-brain state.
Q29. How do you handle secrets in Terraform?
Never commit them to .tf files. Inject via TF_VAR_* env vars from a secret manager at pipeline runtime, or use the vault provider to fetch from HashiCorp Vault. State files contain secrets — encrypt the backend and restrict access.
Q30. Module vs workspace?
Module = reusable code unit (network, eks, rds-postgres). Workspace = separate state file (often used for environments: dev, staging, prod). Modern best practice is one git repo per stack with separate directories and separate backends per environment, rather than workspaces.
Q31. terraform plan vs apply vs refresh?
refresh: read real-world state, update the state file. plan: compute the diff between declared config and current state. apply: execute the diff. Always plan and review before apply in CI; never apply unreviewed.
Q32. What is drift?
When real infrastructure differs from what is in Terraform state — usually because someone made a manual change in the console. Detect with `terraform plan` showing unexpected diffs. Fix by reverting the manual change, importing it, or accepting and committing.
Q33. Terraform vs Pulumi vs CloudFormation?
Terraform: HCL DSL, multi-cloud, biggest community. Pulumi: real programming languages (TS, Python, Go, C#), same providers. CloudFormation: AWS-only, native, slower releases but tighter integration. Choose Terraform for multi-cloud teams, Pulumi if your team prefers language abstractions.
Cloud (AWS / Azure) (Q34-Q39)
Q34. AWS IAM role vs IAM user?
Users are long-lived identities with permanent credentials — avoid for application use. Roles are assumed temporarily and provide short-lived credentials via STS. Modern best practice: zero IAM users for humans (use SSO / Identity Center) and zero IAM users for workloads (use IRSA / Pod Identity).
Q35. What is IRSA (IAM Roles for Service Accounts)?
EKS feature that maps a Kubernetes ServiceAccount to an IAM Role via OIDC. Pods using that SA receive short-lived AWS credentials automatically. Replaces the ugly "kube2iam" / node-role approach. Equivalent in Azure: AKS Workload Identity.
Q36. AWS VPC peering vs Transit Gateway?
VPC peering is point-to-point and does not transit (A↔B and B↔C does not give A↔C). Transit Gateway is a hub-and-spoke router connecting many VPCs, on-prem (via VPN/DX), and Direct Connect Gateways. Use TGW once you have more than a few VPCs.
Q37. How do you make EKS / AKS production-ready?
Private control plane endpoint, OIDC for workload identity, audit logs to a SIEM, network policies (Calico, Cilium), Pod Security Admission (restricted profile), image scanning + admission control (Kyverno, OPA Gatekeeper), node groups across multiple AZs, autoscaling, cluster autoscaler or Karpenter, GitOps deployment (ArgoCD), Prometheus + alerts, regular DR drills.
Q38. ALB vs NLB?
Application Load Balancer is Layer 7 (HTTP/HTTPS) with content-based routing, host/path rules, OIDC auth, and WAF integration. Network Load Balancer is Layer 4 (TCP/UDP/TLS) with ultra-low latency and preserves source IP. Use ALB for HTTP workloads, NLB for non-HTTP or extreme throughput.
Q39. How do you reduce cloud costs?
Right-size (CPU and memory utilization < 30% → downsize); Reserved Instances / Savings Plans for steady workloads; Spot / Preemptible for batch; autoscale aggressively; lifecycle policies on object storage (Hot → Cool → Archive); kill unused resources (untagged, old snapshots); FinOps culture — chargeback / showback by team.
Observability & SRE (Q40-Q44)
Q40. What are the three pillars of observability?
Metrics (time-series, aggregated), Logs (discrete events, high cardinality), Traces (request flow across services). OpenTelemetry is the vendor-neutral standard for instrumenting all three. Some practitioners add a fourth: profiles (continuous CPU / memory profiling).
Q41. What is SLI, SLO, SLA?
SLI: indicator (latency p99, error rate, availability). SLO: target on the SLI (e.g. 99.9% successful requests over 30 days). SLA: contractual commitment to customers with consequences if missed. SLOs drive error budgets; SLAs drive credits and refunds.
Q42. What is an error budget?
The complement of an SLO. If SLO is 99.9% availability, error budget is 0.1% — about 43 minutes per month of downtime allowed. When the budget is exhausted, the team shifts focus from features to reliability work. Concrete operational mechanism for balancing speed vs stability.
Q43. Prometheus vs CloudWatch / Azure Monitor?
Prometheus is pull-based, dimensional, open-source, with PromQL queries. CloudWatch / Azure Monitor are managed cloud-native services. Most teams use cloud-native for infra metrics and Prometheus + Grafana for application metrics — or send Prometheus metrics into managed services (Amazon Managed Prometheus, Azure Managed Prometheus).
Q44. What is chaos engineering?
Deliberately injecting failure (terminate pods, throttle network, kill availability zone) to validate that your system handles it. Tools: Chaos Mesh, Litmus, AWS Fault Injection Service, Gremlin. Done in production once your team is mature; in staging until then.
Production Scenarios (Q45-Q50)
Q45. A deploy went bad. Latency p99 jumped 5x. What do you do?
Roll back first, debug second (this is the standard SRE answer). Activate the rollback procedure (helm rollback, ArgoCD revert, blue/green switch back). Once latency normalizes, pull diff between the two versions, check error logs and traces from the bad window, file an incident ticket, run a blameless post-mortem within a week.
Q46. The Kubernetes cluster keeps OOM-killing pods. Walk me through.
Check the pod's `kubectl describe pod` for OOMKilled events → check memory limit vs actual usage in metrics → profile the application (Java heap dump, Go pprof, Node.js --inspect) → either raise the limit if the workload genuinely needs more memory or fix the leak. If many pods are affected, check whether VPA recommendations make sense; if just one, treat as a workload bug.
Q47. Terraform state got corrupted. How do you recover?
If the backend has versioning (S3 versioning, blob versioning) — restore the previous version. Otherwise, take the broken state offline, manually rebuild it by importing each resource (`terraform import`), or use `terraform state mv` / `terraform state rm` to surgically repair. Always have versioned backend storage. Always.
Q48. AWS bill jumped 40% last month. Where do you look?
Cost Explorer grouped by service — identify the top mover. Common culprits: orphaned EBS volumes, public IPs on stopped EC2, idle NAT Gateway traffic, S3 lifecycle not configured, log groups without retention, RDS / Aurora over-provisioned, NLB / ALB without traffic. Set up Cost Anomaly Detection so this is caught earlier next time.
Q49. CI is slow. How do you speed it up?
Profile the pipeline — find the longest step. Common wins: cache dependencies (npm, pip, Maven, Docker layers); parallelize independent jobs; use larger runners or self-hosted runners; sharded tests with `pytest-xdist` / `jest --shard`; skip steps when only docs change; container image layers as cache; pre-built builder images. Aim for < 10 min PR feedback.
Q50. Tell me about a production incident you handled.
Always include: (a) the symptom, (b) how you identified the root cause (which signals you pulled, what you ruled out), (c) the fix, (d) the long-term remediation (better alerting, runbook authored, IaC adoption, test added). Bonus points for naming what you would do differently next time.
Practice DevOps Scenarios with Free AI Questions
Free DOP-C02 (AWS DevOps Professional) and AZ-400 (Azure DevOps) practice exams.
Try DOP-C02 Practice ExamFrequently Asked Questions
What is the difference between continuous delivery and continuous deployment?
Continuous delivery means every commit produces a deploy-ready artifact; deploy is a manual decision. Continuous deployment means every passing commit deploys automatically.
What is the difference between Docker and Kubernetes?
Docker is a container runtime (single host). Kubernetes is a container orchestrator (cluster of hosts). They solve different problems.
What is Infrastructure as Code (IaC)?
Provisioning infrastructure through declarative configuration files in version control. Tools: Terraform, Pulumi, CloudFormation, Bicep, CDK. Benefits: repeatability, reviewability, drift detection.
What is the difference between a Deployment and a StatefulSet in Kubernetes?
Deployments manage stateless interchangeable pods. StatefulSets manage stateful pods with stable identity and persistent storage.
What is observability vs monitoring?
Monitoring checks known indicators against thresholds. Observability lets you ask new questions about your system from telemetry without shipping code.
Plan Your DevOps Career
Free tools to build your certification roadmap and benchmark interview prep time.
