What is the difference between continuous integration, continuous delivery, and continuous deployment, and what does each one require from a team?

CI: every commit is integrated to the trunk and automatically validated with tests. Continuous delivery: the artifact is always ready to deploy to production, but the actual deploy is a manual decision. Continuous deployment: every commit that passes the tests ships to production automatically without human intervention. The required organizational maturity increases at each stage; many teams reach continuous delivery but do not adopt continuous deployment for business reasons.

When would you choose a blue-green deployment over a canary deployment, and what risks does each one mitigate?

Blue-green: two identical environments, with traffic switched entirely from one to the other. Rollback is immediate by simply redirecting traffic back. The cost is running double the infrastructure during the deploy window. Canary: traffic is shifted gradually to the new deployment, with monitoring between each step. Problems are detected against a small percentage of users before affecting everyone. More complex to implement but safer for high-traffic systems.

What security considerations would you apply when designing a CI/CD pipeline?

Never hardcode secrets in the codebase or pipeline configuration. Use an external secrets manager. Scope permissions at each pipeline stage to the minimum required. Scan dependencies and code for vulnerabilities on every build. Sign artifacts to guarantee their integrity. Audit who can modify the pipeline configuration — it is a critical attack surface.

What is infrastructure as code and what problems does it solve compared to manual infrastructure management?

IaC describes infrastructure in versioned, reproducible configuration files. It solves: environments drifting from each other due to undocumented manual changes, inability to audit what changed and who changed it, difficulty recreating an environment from scratch, and slow provisioning processes prone to human error. The state of your infrastructure becomes reviewable code subject to the same process as application code.

How would you define CPU and memory resources for a container in Kubernetes, and why does it matter?

Kubernetes distinguishes between requests (the minimum guaranteed allocation used by the scheduler for placement decisions) and limits (the maximum a container can consume before being throttled or killed). Without requests, the scheduler cannot make informed placement decisions. Without limits, a single container can exhaust a node's resources and affect other pods. Values should be based on actual profiling of the application, not rough estimates.

What strategy would you follow to manage different configurations across environments (dev, staging, prod) without duplicating infrastructure code?

Parameterize IaC modules with per-environment variables. In Terraform, use workspaces or environment-specific directories with tfvars files. In Kubernetes, use Helm charts with per-environment values files or Kustomize with overlays. The infrastructure code is the same; only the values differ. Never hardcode environment-specific values inside reusable modules.

What is an SLO and how would you use it to drive operational decisions?

An SLO (Service Level Objective) is a measurable reliability target — for example, 99.9% of requests must respond in under 200ms. The error budget is the allowed margin of non-compliance. When the error budget is exhausted, the team prioritizes reliability over new features. When there is budget remaining, the team can accept more deployment risk. It makes the trade-off between development velocity and system stability explicit and data-driven.

How would you approach reducing alert noise in a system where the team has stopped paying attention to notifications?

Audit all existing alerts and classify them as: actionable (requires immediate intervention), informational (no action needed), or noisy (fires frequently without real consequences). Eliminate or downgrade the latter two categories. Every alert should have an associated runbook that explains exactly what to do. The goal is that every alert that reaches the on-call engineer is meaningful and has a clear, documented response.

How would you diagnose a pod stuck in CrashLoopBackOff in Kubernetes, and what are the most common causes?

Start with kubectl describe pod to review events and the exit code. Then kubectl logs --previous to see the logs from the last failed attempt. Common causes: application misconfiguration (missing or incorrect environment variables), an overly aggressive liveness probe killing the pod before it finishes initializing, resource exhaustion (OOMKilled with exit code 137), or unavailable external dependencies (database, secrets) at startup time.

How would you manage Terraform state in a team with multiple engineers working on the same infrastructure?

Store state in a shared remote backend (S3 + DynamoDB for locking on AWS, GCS on GCP, or Terraform Cloud). State locking prevents two engineers from applying changes simultaneously and corrupting the state file. Separate state by environment and by infrastructure layer (networking, compute, data) to limit the blast radius of a mistake. Review terraform plan output in CI before manually applying to production.

What is the difference between a Deployment and a StatefulSet in Kubernetes, and when would you use each?

Deployment: for stateless applications where pods are interchangeable. The scheduler can freely create, move, and delete them. StatefulSet: for applications that require a stable identity (predictable hostname), per-instance persistent storage, and deterministic startup and shutdown ordering. Used for databases, message brokers, or any service that maintains local state. Each pod has its own PersistentVolumeClaim that outlives the pod itself.

How would you implement distributed tracing in a microservices system to correlate requests across service boundaries?

Use OpenTelemetry as the instrumentation standard across every service. Each request generates a trace ID that is propagated in HTTP headers between services. Spans represent individual operations with duration and metadata. Traces are sent to a backend such as Jaeger, Tempo, or Datadog. The correlation enables you to see the complete path of a slow or failing request across multiple services, pinpointing exactly where the problem occurred.

How would you design the networking strategy in a Kubernetes cluster to isolate workloads with different security requirements?

Use namespaces to group workloads by team or sensitivity level. Apply NetworkPolicies to control pod-to-pod traffic: default deny-all, then explicitly allow only the required communication paths. For stronger isolation, evaluate a service mesh like Istio or Cilium for mutual TLS between services and more granular policies. Isolate critical workloads on dedicated node pools using taints and tolerations.

How would you ensure a Terraform infrastructure change has no unintended effects before applying it to production?

Run terraform plan in CI and review the output before approving. Use terraform plan -detailed-exitcode to fail the pipeline if unexpected changes are detected. Implement governance policies with Sentinel or OPA to automatically reject plans that violate rules (e.g., instances without required tags, resources without encryption). For high-risk changes, apply to staging first and compare behavior before applying to production.

What strategies would you use to reduce cloud infrastructure costs without degrading performance or availability?

Use spot or preemptible instances for interruption-tolerant workloads (queue workers, CI jobs). Implement autoscaling driven by real metrics to avoid paying for idle capacity. Review underutilized resources with cost management tooling (AWS Cost Explorer, Infracost). Right-size instance types for each workload based on actual profiling. Apply lifecycle policies to S3 data to automatically tier infrequently accessed objects to cheaper storage classes.

How would you implement automatic database credential rotation with no application downtime?

Use a secrets manager with native rotation support (AWS Secrets Manager, HashiCorp Vault). The rotation process creates new credentials, validates them, updates the secret in the manager, and then revokes the old ones. The application must either fetch credentials from the manager on each new connection, or implement a connection pool renewal mechanism that triggers on authentication errors. Test the full rotation process in staging before enabling it in production.

How would you design a deployment system architecture that supports multiple teams with different release cadences and levels of autonomy?

Build an internal developer platform (IDP) with abstractions over Kubernetes (Backstage, Crossplane, or Humanitec). Each team has autonomy to deploy within guardrails defined by the platform team: resource quotas, security policies, mandatory observability patterns. The platform team is an enabler, not a gatekeeper. Golden path templates allow new teams to start with all the right practices in place by default.

How would you implement disaster recovery for a critical system with an RTO of one hour and an RPO of 15 minutes?

An RPO of 15 minutes requires near-real-time data replication: streaming replication to a secondary region or continuous backups with point-in-time recovery. An RTO of one hour requires pre-provisioned infrastructure in the DR region (warm standby) — not building it from scratch at the time of the disaster. Automate failover with runbooks that are regularly tested in DR exercises. Document the measured RTO/RPO, not just the theoretical targets.

How would you structure a GitOps strategy to manage the desired state of multiple production Kubernetes clusters?

A configuration repository serves as the single source of truth for the desired state of each cluster. Tools like ArgoCD or Flux continuously reconcile the cluster against the repository, automatically detecting and correcting drift. All cluster changes are made exclusively via pull requests to the repository — never via direct kubectl in production. This provides a complete audit trail, change review, and the ability to roll back instantly by reverting a commit.

How would you design an observability strategy for a payment processing system with a 99.99% availability SLA?

Define precise SLIs: transaction success rate, p99 latency, end-to-end processing time. Set SLOs with tight error budgets given the 99.99% target. Implement multi-level alerting: burn rate alerts that detect degradation before the SLO is breached. Distributed traces for every transaction. Maintain separate business dashboards (transactions per second, volume) and technical dashboards (latency, error rate). On-call with a clear escalation policy and tested runbooks.

How would you assess whether a development team is ready to adopt continuous deployment, and what process and technical changes would be required?

Technical prerequisites: a high-reliability test suite, feature flags to decouple deploy from release, rollback capability in under five minutes, and observability that detects regressions quickly. Process prerequisites: a culture where developers own their code in production, small and frequent deployments (not large batches), and blameless postmortems that drive improvements. The biggest blocker is typically cultural, not technical.

How would you approach reducing the blast radius of an incident in a microservices architecture with complex inter-service dependencies?

Implement circuit breakers to prevent a single service failure from cascading. Design for graceful degradation: if a recommendations service fails, show default results rather than failing the entire page. Use bulkheads to isolate the thread pools of critical dependencies from non-critical ones. Set aggressive timeouts on all inter-service calls. Run controlled chaos engineering regularly to validate that isolation mechanisms work as expected.

Confusing familiarity with tools with understanding the principles they implement

Knowing how to run Terraform commands is not the same as understanding state management, drift detection, or module strategies at scale. Interviewers at companies with complex infrastructure ask the 'why' behind every tool choice. A candidate who can only describe commands without explaining the design decisions behind them fails to distinguish junior from senior.

Not mentioning observability when designing or proposing infrastructure changes

Proposing an architecture without addressing how you will know it is working correctly in production, how you will detect degradation, or how you will diagnose an incident signals incomplete thinking. Observability is not an add-on — it is part of the design from the start.

Talking about automation without mentioning the validation gates and rollback mechanisms

Automating a deployment without addressing what stops it when something goes wrong, how the problem is detected, and how it is reverted introduces more risk than the manual process it replaces. Interviewers at companies with high deployment frequency specifically probe for these safety mechanisms.

Not treating cost as a design variable in infrastructure decisions

Proposing highly available solutions without discussing their cost, or without evaluating whether the level of availability actually justifies that cost for the specific use case, signals immaturity in the role. A DevOps engineer must be able to articulate the reliability-versus-cost trade-off with concrete data.

Describing on-call processes without explaining how operational burden is reduced over time

A team that normalizes pager noise, alert fatigue, or recurring incidents without generating action items to eliminate them is accumulating operational debt. Experienced interviewers ask what you did after the incident to prevent it from happening again — not just how you resolved it.

Proposing Kubernetes as the solution to every infrastructure problem without accounting for the complexity it introduces

Kubernetes solves real problems at scale, but adds a significant operational complexity overhead. Recommending it for a three-person startup or an application with 100 daily users demonstrates poor judgment about the fit between solution and problem. A mature DevOps engineer evaluates whether the problem actually justifies that complexity.

DevOps Engineer

Accelerates software delivery by building the systems, pipelines, and culture that connect development to production.

A DevOps Engineer designs and maintains the infrastructure, continuous integration and delivery pipelines, and operational practices that enable development teams to ship software quickly, reliably, and securely. Their work spans build and deployment automation, production system observability, infrastructure as code management, and incident response. They act as a bridge between development and operations teams, championing a culture of shared ownership over system reliability.

DockerKubernetesTerraformCI/CDAWSObservability

Recruit the best DevOps Engineer here

Start now

Main Responsibilities

•Design, implement, and maintain CI/CD pipelines that automate the build, testing, and deployment of applications across multiple environments.
•Manage infrastructure as code using tools like Terraform or Pulumi, ensuring reproducibility and version control.
•Administer Kubernetes clusters and cloud platforms, optimizing for cost, availability, and scalability.
•Implement observability systems with metrics, logs, and distributed traces to detect and diagnose production issues.
•Define and execute incident response processes, including runbooks, postmortems, and preventive improvements.
•Ensure infrastructure and pipelines comply with security policies, regulatory requirements, and secrets management practices.

Key Skills

Technical Skills

Containerization with Docker and orchestration with Kubernetes: deployments, services, ingress, RBAC, and resource management
Infrastructure as code with Terraform or Pulumi: modules, state management, workspaces, and drift policies
Cloud platforms (AWS, GCP, or Azure) with deep knowledge of their compute, networking, storage, and security services
CI/CD pipeline design and implementation with GitHub Actions, GitLab CI, Jenkins, or equivalent
Observability: metrics with Prometheus/Grafana, logs with ELK or Loki, and distributed tracing with OpenTelemetry
Scripting and automation with Bash, Python, or Go for operational tasks and internal tooling

Soft Skills

Reliability mindset: thinking first about how a system can fail before thinking about how to build it
Clear communication during incidents to coordinate technical response while keeping non-technical stakeholders informed
Collaboration with development teams to design systems that are operable from the start, not as an afterthought
Rigorous documentation of runbooks, architecture decisions, and postmortems that the team can consume and keep current
Ability to prioritize reliability improvements over new features under business pressure
Systems thinking to identify single points of failure and hidden dependencies in complex architectures

Real use cases

Context

A well-designed pipeline reduces the time from commit to production, with all the necessary validations to maintain quality and security throughout.

Real examples

Pipelines with lint, test, build, vulnerability scanning, and deploy stages
Blue-green or canary deployment strategies for zero-downtime releases
Automatic rollback triggered by error rate or latency metrics post-deploy
Ephemeral environments per pull request for isolated testing before merge

Context

Infrastructure as code allows cloud resources to be managed with the same engineering practices as software: review, versioning, testing, and reproducibility.

Real examples

Reusable Terraform modules for networking, compute, and databases
Multi-environment management (dev, staging, prod) with separate workspaces or stacks
Automatic drift detection between declared state and actual infrastructure
Compliance policies as code with Sentinel or OPA

Context

In systems with multiple services, observability is what enables you to diagnose where and why something is failing when a user reports a problem.

Real examples

Metrics stack with Prometheus and operational dashboards in Grafana
Correlation of logs, metrics, and traces using a unique correlation ID per request
Alerts with associated runbooks to reduce mean time to response during incidents
SLIs, SLOs, and error budgets as a framework for reliability decisions

Context

Secrets — credentials, API keys, certificates — are the most common attack vector in cloud infrastructure. Poor secrets management is a frequent cause of security breaches.

Real examples

HashiCorp Vault or AWS Secrets Manager integration into pipelines and applications
Automatic rotation of database credentials and API keys
Secret scanning in commits with tools like Gitleaks or Trufflehog
Least-privilege principle applied to IAM roles and Kubernetes service accounts

Context

Systems fail. The difference between mature and immature operations organizations is the speed of detection and response, and the ability to learn from every incident.

Real examples

Automated runbooks for the most frequently occurring incident types
Blameless postmortems that produce measurable action items
Controlled chaos engineering to surface weaknesses before production does
On-call rotations with equitable scheduling and progressive reduction of noisy alerts