50 Site Reliability Engineer Interview Questions for 2026 (with Real Answers)
The exact questions SRE hiring managers ask in 2026 — SLO/SLI/error budgets, incident response, observability, on-call, automation, and the post-mortem stories that close the offer.

Table of Contents
SRE interviews in 2026 are heavy on judgement, not just facts. Expect: concept questions on the Google SRE book vocabulary, debugging scenarios on whiteboard or shared screen, a "tell me about a bad incident" round, and at least one design question on observability or capacity planning. This list covers all four.
Tip: Read at least the free chapters of Site Reliability Engineering and The Site Reliability Workbook before the interview. Interviewers will use the book's vocabulary (SLI, SLO, error budget, toil, post-mortem) and will notice if you do not.
How to Prepare
- Build a real production-style system. Even a small project with Prometheus + Grafana + Loki + alerts in GitHub or a free Grafana Cloud account.
- Read the Google SRE book. Free at sre.google. The first 10 chapters are the interview vocabulary.
- Practice with our free AWS DevOps Professional questions — scenario style overlaps heavily with SRE interviews.
- Know one observability stack cold. Prometheus + Grafana + Tempo + Loki is the most asked; Datadog and Honeycomb appear in vendor-shop interviews.
- Bring two incident stories. One where you found the root cause, one where the post-mortem changed how the team operates.
Fundamentals (Q1-Q8)
Q1. What is Site Reliability Engineering?
SRE is what happens when you treat operations as a software problem. Originating at Google, it combines software engineering practices with operations responsibilities. Practitioners define SLOs, run on-call, automate toil, write code, run capacity planning, and lead incident response. "class SRE implements interface DevOps" — SRE is a prescriptive implementation of DevOps culture.
Q2. SRE vs DevOps?
DevOps is a broad culture for shortening the dev-to-prod feedback loop. SRE is a specific implementation with measurable reliability targets (SLOs), error budgets, and explicit decisions when reliability budget is exhausted. DevOps does not require SLOs; SRE does.
Q3. SRE vs Platform Engineering?
SRE owns reliability of services. Platform Engineering builds internal developer platforms (IDP) — self-service tooling for product engineers (Backstage portals, golden paths, deployment templates). The roles overlap on tooling but differ in primary customer: SRE serves end users; Platform serves internal developers.
Q4. What is toil and why is it tracked?
Toil is manual, repetitive, automatable work with no enduring value, scaling linearly with service growth. Google SRE caps toil at 50% of an SRE's time so the rest is spent on engineering work that reduces future toil. Track toil because the alternative is silent burnout and stagnant tooling.
Q5. Toil vs overhead?
Toil is service-related operational work that could be automated (manual deploys, restarting a service, copy-paste runbook steps). Overhead is general work (meetings, performance reviews, recruiting). Both are necessary but tracked separately.
Q6. What are the four golden signals?
Latency, Traffic, Errors, Saturation. From the Google SRE book. A minimum set of signals for monitoring a user-facing service. Bigger teams extend with USE (Utilization, Saturation, Errors) for resources and RED (Rate, Errors, Duration) for requests.
Q7. What is a service degradation vs outage?
Outage: the service is unavailable to all users for a period. Degradation: the service is technically up but failing some requests, returning errors, or slower than the SLO threshold. SREs usually care more about degradations — they consume the error budget without triggering the dramatic incident response of an outage.
Q8. What is the concept of "embracing risk"?
Perfect reliability is the wrong target — it is expensive, slows feature velocity, and end users cannot tell the difference between 99.99% and 99.999% in most products. Embracing risk means deliberately setting an SLO that allows some failure budget, then using that budget for feature releases and experiments.
SLI / SLO / Error Budgets (Q9-Q15)
Q9. Define SLI, SLO, SLA.
SLI: indicator (latency p99, success rate, availability). SLO: target (99.9% over 30 days). SLA: external commitment with refunds attached. SLOs are internal engineering numbers; SLAs are sales/legal numbers. The SLA is always lower than the SLO so the team has headroom.
Q10. What is an error budget?
If SLO is 99.9% over 30 days, the error budget is 0.1% — about 43 minutes of allowed bad events. Track burn rate. Below 100% budget consumed: feature work continues. Above 100%: feature freeze, reliability work prioritized.
Q11. How do you pick a good SLO?
Start from user-visible behaviour, not infrastructure metrics. Ask: what does the user expect? Negotiate with product on the floor of acceptable performance. Set the initial SLO close to current performance so the team is not immediately in violation. Tighten over time as reliability improves. Track 30-day rolling window.
Q12. What is an SLI based on request success?
(successful requests / total requests) over a window. Senior answers also discuss: what counts as a "successful" request (HTTP 200, 2xx, 3xx, anything < 500?), what about long latency (slow = bad?), and how to handle missing data (treat as failure or exclude?).
Q13. Burn rate alerts — what are they?
Alerts based on the rate at which you are consuming error budget, not absolute thresholds. Fast burn rate (consuming 1 month of budget in 1 hour) pages immediately; slow burn rate (consuming budget faster than monthly target) creates a ticket. Use multi-window multi-burn-rate alerts — the Google SRE workbook documents the exact thresholds.
Q14. SLOs for batch jobs?
Pick from: success ratio (jobs that complete vs total), data freshness (max staleness), correctness (data quality checks). Latency SLOs do not work for batch — replace with deadline SLOs (job completes by hour H).
Q15. What is a 99.9% availability SLO in real numbers?
99.9% over 30 days = 43 minutes 49 seconds of allowed downtime per month, or 8 hours 45 minutes per year. 99.99% = 4 minutes 22 seconds per month. 99.999% (five nines) = 26 seconds per month — almost impossible to hit without active-active multi-region.
Observability (Q16-Q22)
Q16. Metrics vs logs vs traces?
Metrics: time-series, aggregated, cheap, low cardinality. Logs: discrete events, structured or unstructured, expensive at scale, full detail. Traces: request flow across services, parent/child span hierarchy, exposes latency contributors. Modern observability includes a fourth: profiles (continuous CPU/memory profiles).
Q17. Observability vs monitoring?
Monitoring is checking known indicators against thresholds. Observability is the ability to ask new questions about your system from telemetry without shipping new code. Observability needs high cardinality, structured data, correlation between metrics/logs/traces. Monitoring catches known unknowns; observability lets you debug unknown unknowns.
Q18. What is OpenTelemetry?
Vendor-neutral standard for instrumenting metrics, logs, and traces. Includes SDKs for major languages and the Collector (an agent that receives, processes, and exports telemetry to any backend). Replaces OpenTracing and OpenCensus. The 2026 default for new instrumentation.
Q19. PromQL — show me a query for request rate.
sum(rate(http_requests_total{job="api"}[5m])) by (status). Senior answers also know histogram_quantile for percentiles, label_replace for relabelling, and recording rules for pre-aggregating expensive queries.
Q20. What is high cardinality and why does it matter?
Cardinality = number of unique combinations of label values. High cardinality (e.g. user ID as a label) explodes the index size in Prometheus and similar systems and degrades performance. Logs and traces handle high cardinality; metrics generally should not. Push high-cardinality data into traces, not metric labels.
Q21. Pull vs push monitoring?
Pull (Prometheus): scraper pulls metrics on a schedule. Good for service discovery, target health visibility. Push (StatsD, InfluxDB line protocol, OTLP): client pushes metrics. Required for short-lived workloads (batch jobs, Lambda). Prometheus supports push via Pushgateway for these cases.
Q22. What are RED, USE, and the four golden signals?
RED (request-oriented): Rate, Errors, Duration. USE (resource-oriented): Utilization, Saturation, Errors. Four golden signals (Google): Latency, Traffic, Errors, Saturation. They overlap. Pick one as the default for new services and stay consistent.
Incident Response (Q23-Q29)
Q23. Walk me through your incident response process.
(1) Detect — alert fires. (2) Acknowledge in the on-call tool, declare incident severity. (3) Assemble responders: Incident Commander, Subject Matter Experts, Comms lead. (4) Mitigate first, root-cause second (rollback, traffic shift, feature flag off). (5) Communicate status updates every 30 min on the customer status page. (6) Resolve and document. (7) Blameless post-mortem within 5 business days.
Q24. What is a blameless post-mortem?
A post-mortem that focuses on system / process failures rather than individual blame. Assumes humans made reasonable decisions given the information they had. Outputs: timeline, contributing factors (no single root cause — usually multiple), what worked, what did not, and action items with owners.
Q25. Incident Commander role — what do they do?
Owns the response, not the fix. Assigns roles, makes coordination decisions, decides when to declare resolution, prevents tunnel vision. The IC should not be debugging — that is the SME's job. In small companies the IC is whoever happens to be on-call; in mature orgs there is a rotating IC role separate from the on-call engineer.
Q26. What are the severity levels and how do you pick?
Common scheme: Sev1 (full outage, customer-visible, all-hands), Sev2 (major degradation, paging on-call), Sev3 (partial degradation, business-hours response), Sev4 (no customer impact, ticket). Define triggers explicitly — "10% of error budget burned in 1 hour" → Sev2.
Q27. What does "MTTR" mean and is it useful?
Mean Time To Recover (or Repair, or Resolve). Useful as a high-level trend indicator but easy to game (close incidents fast, ignore root cause). Senior answers point out that MTTR alone is misleading — combine with incident count, post-mortem action item completion rate, and toil tracking.
Q28. What is a paging vs ticketing alert?
Paging: wake someone up — reserved for customer-impacting issues requiring immediate action. Ticketing: queue for business-hours work. Tune alerts so paging is rare and 100% actionable. Pages that turn out to be noise destroy on-call morale and lead to alert fatigue.
Q29. How do you reduce alert fatigue?
Audit existing alerts, remove or retune anything not 100% actionable. Switch from threshold alerts ("CPU > 80%") to SLO-burn-rate alerts (focused on user impact). Group related alerts. Add silence windows for deploys. Quarterly on-call review meeting where the on-call team votes on noisy alerts.
Automation & IaC (Q30-Q35)
Q30. What is the role of automation in SRE?
Eliminate toil. Specifically: codify runbooks into scripts, then scripts into operators / controllers; eliminate manual deploys; auto-remediate common alerts; provision infrastructure as code; standardize service-level concerns (logging, tracing, deploys) in platform tooling.
Q31. Terraform state — where do you store it?
Remote backend with locking. Common: S3 + DynamoDB for AWS, Azure Storage with blob lease, GCS with versioning, or Terraform Cloud. Never store locally for team use. Encrypt at rest. Treat state as sensitive — it contains secrets.
Q32. Kubernetes operator vs controller?
All operators are controllers, but not all controllers are operators. A controller watches resources and reconciles desired vs actual state (e.g. ReplicaSet controller). An operator is a controller that codifies operational knowledge for a specific application (e.g. Postgres operator handles failover, backups, upgrades).
Q33. What is GitOps?
Git is the single source of truth for both code and infrastructure declarations. An agent (ArgoCD, Flux) continuously reconciles the cluster state with the git repo. Rollbacks are reverts in git. Improves security (pull-based, no CI cluster credentials needed) and auditability.
Q34. Blue/green vs canary vs rolling?
Blue/green: full duplicate environment, traffic flips. Canary: gradual percentage shift with metric gates. Rolling: replace pods one-by-one within the same environment. Use canary as the default; blue/green for stateless services where double infra cost is acceptable.
Q35. What is chaos engineering?
Deliberately injecting failure (terminate pods, kill an AZ, throttle network) to validate the system handles it. Tools: Chaos Mesh, Litmus, AWS Fault Injection Service, Gremlin. Practice in staging until your team is mature; in production once you trust the experiments not to amplify failure.
On-Call & Culture (Q36-Q40)
Q36. What does a healthy on-call rotation look like?
6+ people in the rotation; 1 week shifts; primary + secondary tiers; less than 2 pages per shift on average; pages 100% actionable; compensation for being on-call (extra pay or time off); explicit handoff at start/end of shift. Anything worse than this and the team will burn out.
Q37. How do you handle being woken at 3am?
Run the playbook: acknowledge, check the runbook for the alert, follow the steps. If the alert was not actionable or had no runbook, file a follow-up to fix the alert or write the runbook. Do not heroics-debug at 3am — if the situation is unclear, mitigate (rollback, traffic shift) and continue the investigation in the morning with a clear head.
Q38. What is psychological safety in SRE?
The team feels safe raising concerns, admitting mistakes, and asking "stupid" questions. Critical for blameless post-mortems — without psychological safety, post-mortems become defensive theatre and root causes hide. Build it through explicit norms and consistent leadership behaviour.
Q39. How do you stay current as an SRE in 2026?
SREcon talks (free on YouTube), Increment magazine (RIP), public incident write-ups (GitHub, Cloudflare, Slack engineering blogs), the SRE Weekly newsletter, the CNCF landscape page. Public post-mortems are the highest-density learning material in the field.
Q40. SRE certifications — do they matter?
Less than experience and stories, but they help with first-round screens. Strong options: Google Cloud SRE Foundations, AWS DevOps Engineer Professional (DOP-C02), Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate. See our AIOps & SRE certifications guide.
Production Scenarios (Q41-Q50)
Q41. The latency p99 just jumped 5x. Walk me through.
Acknowledge the alert. Check the dashboard — is it traffic-correlated, deploy-correlated, or independent? If a deploy was in the last 30 min, roll back first. Check downstream dependency latency. Check resource saturation (CPU, memory, IO, network). Look at trace samples for the slow segment. If no obvious cause, scale out and continue investigation; never let p99 burn the error budget while you debug.
Q42. Tell me about an incident you handled.
The most important question of the interview. Use STAR. Cover: detection (how you knew), early hypothesis vs actual root cause, decision to mitigate before root-causing, communication (status page, internal Slack), resolution, post-mortem action items, what changed in the system as a result. Bonus: what you would do differently.
Q43. The on-call team is burning out. What do you do?
Quantify: pages per shift, false-positive rate, hours of toil, work-life impact. Audit the top 10 noisiest alerts, retune or remove. Add SLO-burn-rate alerts to replace threshold alerts. Investigate root cause of repeated pages (often a single flapping service). Discuss compensation. Escalate to manager / leadership if the load is structural.
Q44. The error budget is fully consumed mid-month. What do you do?
Stop feature releases. Prioritize reliability work for the remainder of the cycle. This is the entire point of error budgets — an automatic, depoliticized trigger to shift focus. Communicate clearly to product about why. Resume feature releases when the next budget window opens.
Q45. Capacity planning — walk me through.
Project growth from current usage and product roadmap. Convert to resource demands (CPU, memory, network, DB connections, storage). Add headroom (typically 30-50% above forecasted peak). Verify against current capacity. Identify lead time for adding capacity (instance types, DB upgrades, vendor commits). Re-run quarterly.
Q46. The database is the bottleneck. How do you scale?
Diagnose first: slow query log, connection pool saturation, replication lag. Optimize queries and indexes. Add read replicas for read-heavy workloads. Cache hot reads (Redis / Memcached). Move write-heavy tables to a more horizontal store (Cassandra, DynamoDB) if the access pattern fits. Last resort: shard.
Q47. How do you debug a "service is slow but no errors" situation?
Pull trace samples for slow requests, compare to fast requests. Look at which spans are different. Check downstream dependency latency. Check resource saturation. Check thread / connection pool usage. Check garbage collection (JVM, Go). Check cache hit rate. If everything looks normal, look for noisy neighbour on shared infrastructure.
Q48. Design observability for a new microservice.
OpenTelemetry SDK in the service for metrics, logs, and traces. Send to OTel Collector deployed as a DaemonSet / sidecar. Collector exports metrics to Prometheus, traces to Tempo / Jaeger, logs to Loki / Elastic. Grafana dashboards for the four golden signals. SLO definitions in code (Prometheus recording rules or Sloth). Alerts on burn rate. Runbook linked from every alert.
Q49. The post-mortem identified a missing runbook. What is your follow-up?
Action item with owner and due date. Write the runbook within 2 weeks, link it from the alert annotation. Run a tabletop exercise / game day to test the runbook. Schedule a recurring review (quarterly) to keep runbooks current. Track runbook coverage as a team metric.
Q50. How do you make the system more reliable without raising the cost?
Reduce blast radius (smaller deploys, feature flags, gradual rollouts). Test rollback paths in staging. Add circuit breakers between services. Cache more aggressively. Use multi-AZ before multi-region. Audit which features actually justify their reliability cost. Sometimes the cheapest reliability improvement is killing a flaky, low-value feature.
Practice SRE / DevOps Scenarios with Free AI Questions
Free DOP-C02 and AZ-400 practice exams with scenario-style questions.
Try DOP-C02 Practice ExamFrequently Asked Questions
What is the difference between SRE and DevOps?
DevOps is a culture for shortening dev-to-prod feedback. SRE is a specific implementation with measurable reliability targets (SLOs) and error budgets. "SRE implements DevOps."
What is an SLI, SLO, and SLA?
SLI: indicator (latency, success rate). SLO: target (99.9%). SLA: external contractual commitment with refunds.
What is an error budget?
Complement of the SLO. 99.9% over 30 days = 43 minutes of allowed bad events. When consumed, feature work pauses for reliability work.
What are the four golden signals?
Latency, Traffic, Errors, Saturation. From the Google SRE book.
What is toil and why is it tracked?
Manual, repetitive, automatable work that scales linearly. Google SRE caps toil at 50% of an SRE's time.
Land the SRE Role
Free tools to plan your interview prep and certification roadmap.
