Talently
Talently
Cloud Engineer

Cloud Engineer

Designs and operates the cloud infrastructure that allows products to scale reliably, securely, and cost-efficiently.

A Cloud Engineer designs, implements, and maintains the cloud infrastructure and services that underpin the organization's applications and operations. Their work spans cloud network and compute architecture through infrastructure automation, cost management, and security control implementation. Unlike a DevOps Engineer focused on pipelines and continuous delivery, the Cloud Engineer has deeper specialization in the cloud platform: they have in-depth knowledge of the chosen provider's services, their limits, their costs, and cloud-native architecture best practices. They work closely with development, security, and operations teams.

AWSTerraformKubernetesNetworkingFinOpsCloud security

Recruit the best Cloud Engineer here

Start now

Main Responsibilities

  • Design scalable, resilient, and secure cloud-native architectures using cloud provider services according to product needs.
  • Implement and maintain infrastructure as code with Terraform or Pulumi, guaranteeing reproducibility, versioning, and state management.
  • Manage cloud networks: VPCs, subnets, security groups, route tables, NAT gateways, and hybrid connectivity with on-premise environments.
  • Implement cloud security controls: least-privilege IAM, encryption at rest and in transit, audit logging, and secrets management.
  • Optimize cloud infrastructure cost by identifying underutilized resources and designing rightsizing and reserved instance strategies.
  • Ensure system high availability and resilience by designing for failure: multi-AZ, automatic failover, and disaster recovery.

Key Skills

Technical Skills

  • Depth in at least one cloud provider (AWS, GCP, or Azure): compute, networking, storage, database, and security services
  • Infrastructure as code with Terraform: modules, state management, workspaces, and infrastructure testing
  • Kubernetes for container orchestration: deployments, services, ingress, RBAC, persistent volumes, and resource management
  • Advanced cloud networking: VPC design, network segmentation, peering, Transit Gateway, and hybrid connectivity
  • Cloud security: IAM, SCPs, config rules, Security Hub, and CSPM tools for security posture management
  • FinOps: cost management tools, Reserved Instance and Savings Plan strategies, and cost anomaly analysis

Soft Skills

  • Reliability engineering mindset: design for failure before designing for the happy path
  • Rigorous documentation of architecture decisions so others can operate and evolve the infrastructure
  • Collaboration with development teams to design applications that leverage cloud capabilities without creating excessive dependencies
  • Cost judgment: evaluating the trade-off between performance, availability, and cost in every architecture decision
  • Communicating infrastructure technical risk in terms of business impact for non-technical stakeholders
  • Proactiveness in identifying capacity, security, or cost problems before they become incidents

Real use cases

Context

Applications with global users or high availability requirements need architectures that tolerate the failure of an entire region without interrupting service.

Real examples

  • Active-passive architecture with automatic failover on complete region failure
  • Global traffic distribution with Route 53 or Cloud Load Balancing based on latency and availability
  • Cross-region data replication with defined and periodically tested RPO and RTO
  • CDN for static assets with edge caching that reduces load on origin servers

Context

Cloud costs can scale disproportionately if not managed proactively. Organizations without FinOps frequently discover unexpectedly high bills when usage grows.

Real examples

  • Underutilized resource audit: idle instances, unattached volumes, and old snapshots
  • Reserved Instance and Savings Plan implementation for predictable workloads reducing costs by 30-60%
  • Comprehensive tagging strategy that attributes costs by team, environment, and product
  • Cost anomaly alerts that detect unexpected increases before they impact the monthly bill

Context

A poorly designed cloud network is the broadest attack surface in the infrastructure. Correct network isolation limits the blast radius of any compromise.

Real examples

  • VPC design with public, private, and data subnets segmented by function
  • Security Group and NACL implementation following the least privilege principle
  • Private connectivity to cloud services without exposing traffic to the internet (VPC Endpoints)
  • Hub-and-spoke architecture with Transit Gateway to manage connectivity between multiple VPCs

Context

Infrastructure as code is the prerequisite of any reproducible, auditable, and scalable cloud operation. Without IaC, every environment is a snowflake that is difficult to replicate.

Real examples

  • Reusable Terraform modules for the team's most common infrastructure patterns
  • CI pipelines that run plan, validate, and security checks before every apply
  • Multiple environment management (dev, staging, prod) with the same code and different variable values
  • Automatic drift detection between the state declared in Terraform and the actual cloud resources

Context

DR plans that are not regularly tested are hypotheses, not guarantees. A Cloud Engineer designs and tests recovery mechanisms before they are needed.

Real examples

  • Real RTO and RPO definition and validation through scheduled DR drills
  • Backup strategies with point-in-time recovery for critical databases
  • Automated runbooks for the most probable failover scenarios
  • Controlled chaos engineering to validate that resilience mechanisms work in production

Basic questions

A region is a geographic area with multiple data centers. An Availability Zone (AZ) is a data center or group of data centers within a region with independent power, networking, and connectivity. For high availability within a region: deploy resources in at least two or three AZs to tolerate the failure of an entire AZ. For resilience against a full regional failure: multi-region architecture with data replication and traffic failover. Multi-AZ high availability covers most cases; multi-region is only justified when availability requirements or data residency regulations demand it, given its greater complexity and cost.
Least privilege means each user, role, or service has only the permissions required for its specific function — nothing more. Practical implementation: IAM roles by function (developer read-only in production, deploy role for the CI/CD pipeline, database admin role scoped to RDS), no long-lived individual IAM users where a more secure alternative exists, Service Control Policies in AWS Organizations to establish limits no role in the account can exceed, and periodic audits with IAM Access Analyzer to detect unused permissions. The Access Advisor review for each role shows services not used in the last 90 days that are candidates for removal from the policy.
Enable automated backups with at least 7-day retention for point-in-time recovery. Enable Multi-AZ for synchronous high availability with automatic failover in seconds on primary failure. For cross-region DR, configure Cross-Region Automated Backups that replicate snapshots to a secondary region with a lag defined by the RPO. Test restoration periodically by measuring the actual recovery time to verify the RTO is achievable. Backups that have never been restored are hypotheses, not guarantees.
Reserved Instances (RIs) commit to a specific instance type in a region for 1 or 3 years in exchange for discounts of 30-60% over on-demand pricing. Savings Plans offer a similar discount in exchange for committing to a minimum hourly compute spend, with greater flexibility to change instance type. RIs are optimal when the workload has a stable and predictable instance type. Savings Plans are preferable when there is instance type variability or when multiple compute services are used (EC2, Lambda, Fargate). For Spot Instances: interruption-tolerant workloads like batch processing, CI/CD runners, or stateless workers that can save up to 90%.
Immediately activate AWS Cost Explorer with daily granularity to identify which service and which account the increase started in. Review the Cost and Usage Report to identify the specific resource with the greatest change. Common causes: instances not terminated after an experiment, an S3 bucket with massive data transfer (data egress), a code loop making excessive AWS API calls, or compromised credentials used for cryptocurrency mining. Contain the cost immediately if it is an unnecessary resource. Configure budget alerts with notifications at 80% and 100% of expected monthly spend so this is not discovered at month-end again.
A VPC Endpoint allows resources inside a VPC to communicate with AWS services (S3, DynamoDB, SecretsManager) over the AWS private network without traffic leaving to the internet. Without VPC Endpoints, instances in private subnets need a NAT Gateway to access these services, exposing traffic to the public network and adding transfer cost. With VPC Endpoints: traffic never leaves the AWS network, bucket policies can restrict access to only the specific VPC, and the NAT Gateway cost for that traffic is eliminated. They are especially critical for S3 and DynamoDB access from instances handling sensitive data.
The most secure option is using AWS Secrets Manager or Parameter Store integrated with Kubernetes via the Secrets Store CSI Driver: secrets are stored in AWS, accessed through IAM roles assigned to pods using IRSA (IAM Roles for Service Accounts), and mounted as files in the pod filesystem without going through Kubernetes etcd — where native Kubernetes Secrets are stored in Base64 (not real encryption). Avoid hardcoding secrets in deployment manifest environment variables that are stored in git. Enable automatic secret rotation in Secrets Manager so the application always uses fresh credentials.
Implement AWS Organizations with an account strategy per team or environment for billing isolation. Mandatory tagging of all resources with team, project, and environment tags using SCPs that reject resource creation without the required tags. Configure AWS Budgets per account with alerts to each account's owners. Publish a shared cost dashboard in Cost Explorer accessible to each team showing only their accounts. Monthly FinOps reviews where each team reports their optimizations and anomalies. Cost accountability only works when teams have visibility into their spend and the ability to act on it.

Technical questions

VPC with at least three subnet tiers: public (ALB and NAT Gateway only), private application tier (EC2 instances or ECS tasks), and private data tier (RDS, ElastiCache). Cascading Security Groups: the ALB accepts internet traffic (80/443), application instances accept only traffic from the ALB's security group on the application port, and databases accept only traffic from the application security group on the engine port. NACLs as a second line of defense at the subnet level. VPC Endpoints for S3 and Secrets Manager. VPC Flow Logs enabled for traffic auditing. This architecture ensures no application or data tier component is directly reachable from the internet.
For ECS: Application Auto Scaling with CloudWatch metric-based policies. Business metrics (number of messages in an SQS queue, application requests per second, percentile latency) are published as custom CloudWatch metrics. The reactive scaling policy adjusts the number of tasks when the metric crosses the threshold. For Kubernetes: HPA (Horizontal Pod Autoscaler) consumes metrics from the cluster's metrics API or from external metric adapters (Prometheus Adapter, KEDA). KEDA is especially powerful: it allows scaling directly based on a Kafka queue's lag, the number of SQS messages, or any external metric without writing custom scaling code.
AWS Secrets Manager with automatic rotation manages this process. When rotation is triggered, Secrets Manager creates new credentials in the database, saves them as the new secret version, verifies they are valid, and marks the previous version as AWSPREVIOUS. During the transition period, both versions are simultaneously valid. Applications that retrieve the secret on each connection automatically get the new credentials. For applications with persistent connection pools, implement a pool refresh mechanism when authentication fails with the current credentials: the client fetches the new credentials and reconnects. This pattern guarantees zero-downtime during rotation.
The three pillars in the AWS ecosystem. Metrics: CloudWatch Metrics for AWS infrastructure metrics and custom application metrics, with per-service dashboards and alarms with automatic actions. Logs: CloudWatch Logs with service-structured log groups, Log Insights for ad-hoc queries, and S3 export for long-term retention. Traces: AWS X-Ray for distributed tracing between services, with trace-to-log correlation using the trace ID. For organizations preferring provider-agnostic tools, OpenTelemetry as the instrumentation layer that sends to X-Ray, Datadog, or any OTLP backend. The key is that the trace correlation ID must be propagated in every inter-service request to follow the complete path of a request.
Implement just-in-time access with tools like AWS IAM Identity Center (SSO) with temporary access via privilege elevation. The engineer requests temporary elevated access (maximum 1-4 hours), the system records the request with the reason, and access is granted automatically or with approval. AWS Systems Manager Session Manager provides SSH access to EC2 instances without exposing SSH ports, with complete session logging to S3 or CloudWatch Logs. For databases: access via a managed bastion or RDS Proxy with temporary IAM authentication. All production access must be logged with the user's identity, time, duration, and actions taken for audit purposes.
Separate into three levels. Base infrastructure modules (shared repository): reusable abstractions for the most frequent resources such as a standard VPC, ECS cluster, and RDS instance with security best practices baked in. Teams consume these modules without being able to override the security controls. Environment modules (per-team repository): composition of the base modules with team-specific values (sizes, regions, configuration). Root modules per environment: dev, staging, prod each with their own state files in S3 and locking in DynamoDB. Sensitive variables are never hardcoded in the code: they are injected as environment variables in the pipeline or read from Secrets Manager. Version modules with semantic versioning so teams control when they adopt base module changes.
Lambda for the compute layer: scales from zero to thousands of concurrent instances in seconds without server management. API Gateway for the HTTP layer with rate limiting, authentication, and throttling. DynamoDB for storage with on-demand billing that scales automatically without provisioning. SQS to decouple event ingestion from processing when spikes are very abrupt. EventBridge for workflow orchestration between Lambda functions. Critical considerations: Lambda cold starts (mitigable with provisioned concurrency for critical functions), concurrency limits per function and per account (configure reserved concurrency for critical functions), and per-invocation cost that can exceed persistent instance cost if traffic is very high and constant.
Lambda: for short-lived event-driven workloads (under 15 minutes), with highly variable or sporadic traffic, and where server management is an unjustified overhead. ECS Fargate: for containerized applications without Kubernetes experience, with predictable load, and where server abstraction is desired. EKS: when Kubernetes is needed for cross-cloud portability, a rich tooling ecosystem (service mesh, GitOps, KEDA), or when the team already has expertise. EC2: for workloads with specific hardware requirements, that need direct OS access, or when cost is the absolute priority with Reserved Instances. The decision also factors in team expertise: the best service is the one the team can operate with confidence.

Advanced questions

PCI-DSS requires isolation of the cardholder data environment (CDE) from the rest of the infrastructure. Layered architecture with strict segmentation: the CDE in a separate VPC accessible exclusively from the payment application layer, never directly from the internet. Mandatory controls: data encryption at rest with CMKs managed by KMS, TLS 1.2+ in transit, complete logging of all CDE access in CloudTrail with immutable delivery to S3. For 99.99% availability (under 53 minutes of annual downtime): active-active multi-AZ for all critical path components, active-passive multi-region with sub-minute automated failover, circuit breakers for external dependencies, and monthly chaos engineering tests verifying failover mechanisms work.
Start with an application inventory and classification using the 7 Rs of migration: Retire (systems to be decommissioned), Retain (systems staying on-premise for now), Rehost (lift-and-shift to EC2), Replatform (small optimizations: RDS instead of the on-premise engine), Refactor/Re-architect (redesign for cloud-native). The greatest value with the least risk is in Rehost and Replatform: migrate quickly with AWS Migration Service and optimize afterward. The most critical applications or those that benefit most from cloud-native merit Refactor but require more time and risk. Establish a Landing Zone with the correct accounts, networks, and security controls before migrating the first application. Define clear success criteria for each migration wave.
Security in multiple layers. Development time: tfsec or Checkov in the Terraform pipeline blocking plans with critical misconfigurations (public S3 bucket, security group with 0.0.0.0/0 on sensitive ports). Deploy time: AWS Config Rules or Security Hub continuously evaluating security posture with automatic alerts on deviations. In production: AWS GuardDuty for threat detection in CloudTrail and VPC Flow Logs, Macie for sensitive data identification in S3, and AWS Security Hub as the centralized panel aggregating findings from all security services. For multi-account: AWS Organizations with SCPs that prevent the most critical prohibited configurations at the organizational level. The goal is that security problems are detected and blocked in the pipeline — not discovered in production.
A 30% reduction on a platform at that scale requires a structured FinOps program — not just one-off adjustments. Per-service analysis: identify the five services representing 80% of spend. For each, evaluate: is it the right size for the actual load? Are there workloads that can move to Spot or Savings Plans? Is there data that can move to cheaper storage tiers? Typically the greatest opportunities are: rightsizing of oversized instances (20-30% of savings), Reserved Instances for predictable workloads (15-25%), elimination of orphaned resources, and data transfer optimization. Instrument cost as a product metric: each team sees their weekly spend with trend. Distributed cost accountability produces continuous optimizations without needing a dedicated central team.
The GDPR challenge at scale is the proliferation of copies of user data. Implement a central personal data registry that maps which user's data lives in which bucket, table, and partition. For the right to be forgotten: design data with crypto-shredding where each user's data is encrypted with a unique per-user key in KMS. When a deletion request is received, delete the user's key in KMS: the data remains physically but is irrecoverable. For data in historical backups where crypto-shredding was not implemented from the start: document the maximum retention period of the backups and ensure they are purged according to the policy. AWS Macie for automatic discovery of personal data in S3 that was not catalogued.
Lock-in has real costs but also benefits: native services frequently offer better integration, simpler operation, and lower cost than provider-agnostic alternatives. The correct evaluation is not to eliminate lock-in but to manage the risk strategically. For each proprietary service, evaluate: does an equivalent exist on other cloud providers? How much would migration cost if needed? Does the functionality justify the dependency? Mitigation strategies without eliminating the benefits: abstract dependencies in proprietary service layers that can be re-implemented for another provider (a storage interface that today uses S3 can tomorrow use GCS), use OpenTelemetry for provider-agnostic instrumentation, and avoid lock-in in the critical data components where migration would be most costly.

Common interview mistakes

Every cloud architecture decision has a cost. A Cloud Engineer who proposes multi-AZ, multi-region, and high redundancy for all components without evaluating whether the required availability level justifies that cost demonstrates a lack of FinOps judgment. Interviewers with budget responsibility always ask how much the proposed architecture would cost.
Overly permissive security groups, publicly accessible S3 buckets by default, and broadly permissioned IAM roles are the most frequent and most costly errors to remediate after the fact. A Cloud Engineer who does not mention least privilege, encryption, and network segmentation in their design is proposing architectures that will require remediation work later.
Terraform without a team state management strategy produces conflicts, state corruption, and resource loss. A candidate who describes their Terraform usage without mentioning the remote backend, DynamoDB locking, and module organization demonstrates experience with Terraform on individual projects — not in teams.
A DR plan that has never been executed under controlled conditions is a hypothesis. Interviewers with operations experience ask when the database failover was last tested, how long it actually took, and what unexpected problems appeared during the test. Having no answer to that question is an operational red flag.
Kubernetes solves real problems at scale but adds a significant layer of operational complexity. A Cloud Engineer who proposes EKS for an application that could run perfectly on ECS Fargate or even Lambda demonstrates technology preference over problem-fit judgment. Interviewers at teams with real operational costs ask why Kubernetes and not the simpler alternative.
A cloud architecture without configured metrics, logs, and alerts is not production-ready. A Cloud Engineer who describes their architecture without mentioning how they know it is working correctly, how they detect a problem before the user reports it, and how they diagnose the cause demonstrates an incomplete view of operating production systems.