What is the difference between availability zones and regions in AWS, and how would you use them to design a high-availability system?

A region is a geographic area with multiple data centers. An Availability Zone (AZ) is a data center or group of data centers within a region with independent power, networking, and connectivity. For high availability within a region: deploy resources in at least two or three AZs to tolerate the failure of an entire AZ. For resilience against a full regional failure: multi-region architecture with data replication and traffic failover. Multi-AZ high availability covers most cases; multi-region is only justified when availability requirements or data residency regulations demand it, given its greater complexity and cost.

What is the least privilege principle in IAM and how would you implement it in an AWS account with multiple teams?

Least privilege means each user, role, or service has only the permissions required for its specific function — nothing more. Practical implementation: IAM roles by function (developer read-only in production, deploy role for the CI/CD pipeline, database admin role scoped to RDS), no long-lived individual IAM users where a more secure alternative exists, Service Control Policies in AWS Organizations to establish limits no role in the account can exceed, and periodic audits with IAM Access Analyzer to detect unused permissions. The Access Advisor review for each role shows services not used in the last 90 days that are candidates for removal from the policy.

How would you design the backup and recovery strategy for an RDS database with critical business data?

Enable automated backups with at least 7-day retention for point-in-time recovery. Enable Multi-AZ for synchronous high availability with automatic failover in seconds on primary failure. For cross-region DR, configure Cross-Region Automated Backups that replicate snapshots to a secondary region with a lag defined by the RPO. Test restoration periodically by measuring the actual recovery time to verify the RTO is achievable. Backups that have never been restored are hypotheses, not guarantees.

What are Reserved Instances and Savings Plans in AWS, and when would you use each to optimize costs?

Reserved Instances (RIs) commit to a specific instance type in a region for 1 or 3 years in exchange for discounts of 30-60% over on-demand pricing. Savings Plans offer a similar discount in exchange for committing to a minimum hourly compute spend, with greater flexibility to change instance type. RIs are optimal when the workload has a stable and predictable instance type. Savings Plans are preferable when there is instance type variability or when multiple compute services are used (EC2, Lambda, Fargate). For Spot Instances: interruption-tolerant workloads like batch processing, CI/CD runners, or stateless workers that can save up to 90%.

How would you detect and respond to an unexpected 300% increase in the AWS bill in a single month?

Immediately activate AWS Cost Explorer with daily granularity to identify which service and which account the increase started in. Review the Cost and Usage Report to identify the specific resource with the greatest change. Common causes: instances not terminated after an experiment, an S3 bucket with massive data transfer (data egress), a code loop making excessive AWS API calls, or compromised credentials used for cryptocurrency mining. Contain the cost immediately if it is an unnecessary resource. Configure budget alerts with notifications at 80% and 100% of expected monthly spend so this is not discovered at month-end again.

What is a VPC Endpoint and why is it important for cloud infrastructure security?

A VPC Endpoint allows resources inside a VPC to communicate with AWS services (S3, DynamoDB, SecretsManager) over the AWS private network without traffic leaving to the internet. Without VPC Endpoints, instances in private subnets need a NAT Gateway to access these services, exposing traffic to the public network and adding transfer cost. With VPC Endpoints: traffic never leaves the AWS network, bucket policies can restrict access to only the specific VPC, and the NAT Gateway cost for that traffic is eliminated. They are especially critical for S3 and DynamoDB access from instances handling sensitive data.

How would you manage secrets for an application deployed on Kubernetes in AWS?

The most secure option is using AWS Secrets Manager or Parameter Store integrated with Kubernetes via the Secrets Store CSI Driver: secrets are stored in AWS, accessed through IAM roles assigned to pods using IRSA (IAM Roles for Service Accounts), and mounted as files in the pod filesystem without going through Kubernetes etcd — where native Kubernetes Secrets are stored in Base64 (not real encryption). Avoid hardcoding secrets in deployment manifest environment variables that are stored in git. Enable automatic secret rotation in Secrets Manager so the application always uses fresh credentials.

How would you implement cost monitoring in an organization with multiple AWS accounts so each team is accountable for their spend?

Implement AWS Organizations with an account strategy per team or environment for billing isolation. Mandatory tagging of all resources with team, project, and environment tags using SCPs that reject resource creation without the required tags. Configure AWS Budgets per account with alerts to each account's owners. Publish a shared cost dashboard in Cost Explorer accessible to each team showing only their accounts. Monthly FinOps reviews where each team reports their optimizations and anomalies. Cost accountability only works when teams have visibility into their spend and the ability to act on it.

How would you design the network architecture of a multi-tier web application in AWS following security best practices?

VPC with at least three subnet tiers: public (ALB and NAT Gateway only), private application tier (EC2 instances or ECS tasks), and private data tier (RDS, ElastiCache). Cascading Security Groups: the ALB accepts internet traffic (80/443), application instances accept only traffic from the ALB's security group on the application port, and databases accept only traffic from the application security group on the engine port. NACLs as a second line of defense at the subnet level. VPC Endpoints for S3 and Secrets Manager. VPC Flow Logs enabled for traffic auditing. This architecture ensures no application or data tier component is directly reachable from the internet.

How would you implement autoscaling in ECS or EKS that responds to business metrics in addition to standard CPU and memory metrics?

For ECS: Application Auto Scaling with CloudWatch metric-based policies. Business metrics (number of messages in an SQS queue, application requests per second, percentile latency) are published as custom CloudWatch metrics. The reactive scaling policy adjusts the number of tasks when the metric crosses the threshold. For Kubernetes: HPA (Horizontal Pod Autoscaler) consumes metrics from the cluster's metrics API or from external metric adapters (Prometheus Adapter, KEDA). KEDA is especially powerful: it allows scaling directly based on a Kafka queue's lag, the number of SQS messages, or any external metric without writing custom scaling code.

How would you manage database credential rotation without downtime for the applications that use them?

AWS Secrets Manager with automatic rotation manages this process. When rotation is triggered, Secrets Manager creates new credentials in the database, saves them as the new secret version, verifies they are valid, and marks the previous version as AWSPREVIOUS. During the transition period, both versions are simultaneously valid. Applications that retrieve the secret on each connection automatically get the new credentials. For applications with persistent connection pools, implement a pool refresh mechanism when authentication fails with the current credentials: the client fetches the new credentials and reconnects. This pattern guarantees zero-downtime during rotation.

How would you implement a complete observability strategy for a microservices architecture on AWS?

The three pillars in the AWS ecosystem. Metrics: CloudWatch Metrics for AWS infrastructure metrics and custom application metrics, with per-service dashboards and alarms with automatic actions. Logs: CloudWatch Logs with service-structured log groups, Log Insights for ad-hoc queries, and S3 export for long-term retention. Traces: AWS X-Ray for distributed tracing between services, with trace-to-log correlation using the trace ID. For organizations preferring provider-agnostic tools, OpenTelemetry as the instrumentation layer that sends to X-Ray, Datadog, or any OTLP backend. The key is that the trace correlation ID must be propagated in every inter-service request to follow the complete path of a request.

How would you design a production access strategy for engineers who need temporary access for debugging without compromising security?

Implement just-in-time access with tools like AWS IAM Identity Center (SSO) with temporary access via privilege elevation. The engineer requests temporary elevated access (maximum 1-4 hours), the system records the request with the reason, and access is granted automatically or with approval. AWS Systems Manager Session Manager provides SSH access to EC2 instances without exposing SSH ports, with complete session logging to S3 or CloudWatch Logs. For databases: access via a managed bastion or RDS Proxy with temporary IAM authentication. All production access must be logged with the user's identity, time, duration, and actions taken for audit purposes.

How would you structure Terraform modules for an organization with multiple teams and dozens of environments?

Separate into three levels. Base infrastructure modules (shared repository): reusable abstractions for the most frequent resources such as a standard VPC, ECS cluster, and RDS instance with security best practices baked in. Teams consume these modules without being able to override the security controls. Environment modules (per-team repository): composition of the base modules with team-specific values (sizes, regions, configuration). Root modules per environment: dev, staging, prod each with their own state files in S3 and locking in DynamoDB. Sensitive variables are never hardcoded in the code: they are injected as environment variables in the pipeline or read from Secrets Manager. Version modules with semantic versioning so teams control when they adopt base module changes.

How would you implement a serverless architecture for an application with highly variable and unpredictable traffic spikes?

Lambda for the compute layer: scales from zero to thousands of concurrent instances in seconds without server management. API Gateway for the HTTP layer with rate limiting, authentication, and throttling. DynamoDB for storage with on-demand billing that scales automatically without provisioning. SQS to decouple event ingestion from processing when spikes are very abrupt. EventBridge for workflow orchestration between Lambda functions. Critical considerations: Lambda cold starts (mitigable with provisioned concurrency for critical functions), concurrency limits per function and per account (configure reserved concurrency for critical functions), and per-invocation cost that can exceed persistent instance cost if traffic is very high and constant.

How would you evaluate whether a workload should run on EC2, ECS, EKS, or Lambda?

Lambda: for short-lived event-driven workloads (under 15 minutes), with highly variable or sporadic traffic, and where server management is an unjustified overhead. ECS Fargate: for containerized applications without Kubernetes experience, with predictable load, and where server abstraction is desired. EKS: when Kubernetes is needed for cross-cloud portability, a rich tooling ecosystem (service mesh, GitOps, KEDA), or when the team already has expertise. EC2: for workloads with specific hardware requirements, that need direct OS access, or when cost is the absolute priority with Reserved Instances. The decision also factors in team expertise: the best service is the one the team can operate with confidence.

How would you design the cloud architecture for a fintech platform handling payments with PCI-DSS requirements and 99.99% high availability?

PCI-DSS requires isolation of the cardholder data environment (CDE) from the rest of the infrastructure. Layered architecture with strict segmentation: the CDE in a separate VPC accessible exclusively from the payment application layer, never directly from the internet. Mandatory controls: data encryption at rest with CMKs managed by KMS, TLS 1.2+ in transit, complete logging of all CDE access in CloudTrail with immutable delivery to S3. For 99.99% availability (under 53 minutes of annual downtime): active-active multi-AZ for all critical path components, active-passive multi-region with sub-minute automated failover, circuit breakers for external dependencies, and monthly chaos engineering tests verifying failover mechanisms work.

How would you design the migration strategy from on-premise to AWS for a company with 50 applications of varying ages and technologies?

Start with an application inventory and classification using the 7 Rs of migration: Retire (systems to be decommissioned), Retain (systems staying on-premise for now), Rehost (lift-and-shift to EC2), Replatform (small optimizations: RDS instead of the on-premise engine), Refactor/Re-architect (redesign for cloud-native). The greatest value with the least risk is in Rehost and Replatform: migrate quickly with AWS Migration Service and optimize afterward. The most critical applications or those that benefit most from cloud-native merit Refactor but require more time and risk. Establish a Landing Zone with the correct accounts, networks, and security controls before migrating the first application. Define clear success criteria for each migration wave.

How would you implement a proactive cloud security strategy that detects misconfigurations before they reach production?

Security in multiple layers. Development time: tfsec or Checkov in the Terraform pipeline blocking plans with critical misconfigurations (public S3 bucket, security group with 0.0.0.0/0 on sensitive ports). Deploy time: AWS Config Rules or Security Hub continuously evaluating security posture with automatic alerts on deviations. In production: AWS GuardDuty for threat detection in CloudTrail and VPC Flow Logs, Macie for sensitive data identification in S3, and AWS Security Hub as the centralized panel aggregating findings from all security services. For multi-account: AWS Organizations with SCPs that prevent the most critical prohibited configurations at the organizational level. The goal is that security problems are detected and blocked in the pipeline — not discovered in production.

How would you manage the cost of a cloud platform spending $500K monthly with a team that wants to reduce it by 30% without degrading performance?

A 30% reduction on a platform at that scale requires a structured FinOps program — not just one-off adjustments. Per-service analysis: identify the five services representing 80% of spend. For each, evaluate: is it the right size for the actual load? Are there workloads that can move to Spot or Savings Plans? Is there data that can move to cheaper storage tiers? Typically the greatest opportunities are: rightsizing of oversized instances (20-30% of savings), Reserved Instances for predictable workloads (15-25%), elimination of orphaned resources, and data transfer optimization. Instrument cost as a product metric: each team sees their weekly spend with trend. Distributed cost accountability produces continuous optimizations without needing a dedicated central team.

How would you design a cloud data platform that complies with GDPR, including the right to be forgotten, with petabytes of historical data?

The GDPR challenge at scale is the proliferation of copies of user data. Implement a central personal data registry that maps which user's data lives in which bucket, table, and partition. For the right to be forgotten: design data with crypto-shredding where each user's data is encrypted with a unique per-user key in KMS. When a deletion request is received, delete the user's key in KMS: the data remains physically but is irrecoverable. For data in historical backups where crypto-shredding was not implemented from the start: document the maximum retention period of the backups and ensure they are purged according to the policy. AWS Macie for automatic discovery of personal data in S3 that was not catalogued.

How would you evaluate and mitigate the vendor lock-in risk in a cloud architecture with dependencies on proprietary provider services?

Lock-in has real costs but also benefits: native services frequently offer better integration, simpler operation, and lower cost than provider-agnostic alternatives. The correct evaluation is not to eliminate lock-in but to manage the risk strategically. For each proprietary service, evaluate: does an equivalent exist on other cloud providers? How much would migration cost if needed? Does the functionality justify the dependency? Mitigation strategies without eliminating the benefits: abstract dependencies in proprietary service layers that can be re-implemented for another provider (a storage interface that today uses S3 can tomorrow use GCS), use OpenTelemetry for provider-agnostic instrumentation, and avoid lock-in in the critical data components where migration would be most costly.

Not mentioning costs when designing cloud architectures as if availability were the only criterion

Every cloud architecture decision has a cost. A Cloud Engineer who proposes multi-AZ, multi-region, and high redundancy for all components without evaluating whether the required availability level justifies that cost demonstrates a lack of FinOps judgment. Interviewers with budget responsibility always ask how much the proposed architecture would cost.

Not treating security as part of the architecture design but as a subsequent step

Overly permissive security groups, publicly accessible S3 buckets by default, and broadly permissioned IAM roles are the most frequent and most costly errors to remediate after the fact. A Cloud Engineer who does not mention least privilege, encryption, and network segmentation in their design is proposing architectures that will require remediation work later.

Describing Terraform without mentioning state management, locking, and module strategies

Terraform without a team state management strategy produces conflicts, state corruption, and resource loss. A candidate who describes their Terraform usage without mentioning the remote backend, DynamoDB locking, and module organization demonstrates experience with Terraform on individual projects — not in teams.

Not having tested the disaster recovery mechanisms they designed

A DR plan that has never been executed under controlled conditions is a hypothesis. Interviewers with operations experience ask when the database failover was last tested, how long it actually took, and what unexpected problems appeared during the test. Having no answer to that question is an operational red flag.

Proposing Kubernetes for any workload without evaluating whether the added complexity is justified

Kubernetes solves real problems at scale but adds a significant layer of operational complexity. A Cloud Engineer who proposes EKS for an application that could run perfectly on ECS Fargate or even Lambda demonstrates technology preference over problem-fit judgment. Interviewers at teams with real operational costs ask why Kubernetes and not the simpler alternative.

Not mentioning observability and monitoring when describing production architectures

A cloud architecture without configured metrics, logs, and alerts is not production-ready. A Cloud Engineer who describes their architecture without mentioning how they know it is working correctly, how they detect a problem before the user reports it, and how they diagnose the cause demonstrates an incomplete view of operating production systems.

Cloud Engineer

Designs and operates the cloud infrastructure that allows products to scale reliably, securely, and cost-efficiently.

A Cloud Engineer designs, implements, and maintains the cloud infrastructure and services that underpin the organization's applications and operations. Their work spans cloud network and compute architecture through infrastructure automation, cost management, and security control implementation. Unlike a DevOps Engineer focused on pipelines and continuous delivery, the Cloud Engineer has deeper specialization in the cloud platform: they have in-depth knowledge of the chosen provider's services, their limits, their costs, and cloud-native architecture best practices. They work closely with development, security, and operations teams.

AWSTerraformKubernetesNetworkingFinOpsCloud security

Recruit the best Cloud Engineer here

Start now

Main Responsibilities

•Design scalable, resilient, and secure cloud-native architectures using cloud provider services according to product needs.
•Implement and maintain infrastructure as code with Terraform or Pulumi, guaranteeing reproducibility, versioning, and state management.
•Manage cloud networks: VPCs, subnets, security groups, route tables, NAT gateways, and hybrid connectivity with on-premise environments.
•Implement cloud security controls: least-privilege IAM, encryption at rest and in transit, audit logging, and secrets management.
•Optimize cloud infrastructure cost by identifying underutilized resources and designing rightsizing and reserved instance strategies.
•Ensure system high availability and resilience by designing for failure: multi-AZ, automatic failover, and disaster recovery.

Key Skills

Technical Skills

Depth in at least one cloud provider (AWS, GCP, or Azure): compute, networking, storage, database, and security services
Infrastructure as code with Terraform: modules, state management, workspaces, and infrastructure testing
Kubernetes for container orchestration: deployments, services, ingress, RBAC, persistent volumes, and resource management
Advanced cloud networking: VPC design, network segmentation, peering, Transit Gateway, and hybrid connectivity
Cloud security: IAM, SCPs, config rules, Security Hub, and CSPM tools for security posture management
FinOps: cost management tools, Reserved Instance and Savings Plan strategies, and cost anomaly analysis

Soft Skills

Reliability engineering mindset: design for failure before designing for the happy path
Rigorous documentation of architecture decisions so others can operate and evolve the infrastructure
Collaboration with development teams to design applications that leverage cloud capabilities without creating excessive dependencies
Cost judgment: evaluating the trade-off between performance, availability, and cost in every architecture decision
Communicating infrastructure technical risk in terms of business impact for non-technical stakeholders
Proactiveness in identifying capacity, security, or cost problems before they become incidents

Real use cases

Context

Applications with global users or high availability requirements need architectures that tolerate the failure of an entire region without interrupting service.

Real examples

Active-passive architecture with automatic failover on complete region failure
Global traffic distribution with Route 53 or Cloud Load Balancing based on latency and availability
Cross-region data replication with defined and periodically tested RPO and RTO
CDN for static assets with edge caching that reduces load on origin servers

Context

Cloud costs can scale disproportionately if not managed proactively. Organizations without FinOps frequently discover unexpectedly high bills when usage grows.

Real examples

Underutilized resource audit: idle instances, unattached volumes, and old snapshots
Reserved Instance and Savings Plan implementation for predictable workloads reducing costs by 30-60%
Comprehensive tagging strategy that attributes costs by team, environment, and product
Cost anomaly alerts that detect unexpected increases before they impact the monthly bill

Context

A poorly designed cloud network is the broadest attack surface in the infrastructure. Correct network isolation limits the blast radius of any compromise.

Real examples

VPC design with public, private, and data subnets segmented by function
Security Group and NACL implementation following the least privilege principle
Private connectivity to cloud services without exposing traffic to the internet (VPC Endpoints)
Hub-and-spoke architecture with Transit Gateway to manage connectivity between multiple VPCs

Context

Infrastructure as code is the prerequisite of any reproducible, auditable, and scalable cloud operation. Without IaC, every environment is a snowflake that is difficult to replicate.

Real examples

Reusable Terraform modules for the team's most common infrastructure patterns
CI pipelines that run plan, validate, and security checks before every apply
Multiple environment management (dev, staging, prod) with the same code and different variable values
Automatic drift detection between the state declared in Terraform and the actual cloud resources

Context

DR plans that are not regularly tested are hypotheses, not guarantees. A Cloud Engineer designs and tests recovery mechanisms before they are needed.

Real examples

Real RTO and RPO definition and validation through scheduled DR drills
Backup strategies with point-in-time recovery for critical databases
Automated runbooks for the most probable failover scenarios
Controlled chaos engineering to validate that resilience mechanisms work in production