Cloud Operations Engineer

Posted about 2 months agoViewed

Europe, Latin AmericaFull-TimeHospitality Software

Company:Cloudbeds

Location:Europe, Latin America, EST, PST

Languages:English

Seniority level:Middle, 3+ years

Experience:3+ years

Skills:

AWSDockerPythonBashKubernetesJiraGrafanaPrometheusLinuxTerraform

Requirements:

3+ years of experience in IT operations, technical support, or related field with hands-on exposure to monitoring tools like DataDog, Prometheus/Grafana, or AWS CloudWatch. Strong understanding of incident response procedures, escalation protocols, and emergency response workflows with experience using ticketing systems (Jira) and project management tools. Foundational networking skills and basic understanding of AWS services including EC2, S3, CloudWatch, and IAM. Exposure to containerization concepts (Docker and Kubernetes) and basic ability to read and understand Terraform. Previous experience in a 24/7 operations environment with hands-on use of PagerDuty or similar alerting systems. Excellent written and verbal communication skills in English with ability to provide clear status updates during high-pressure incidents. Detail-oriented with strong documentation skills and ability to work effectively across multiple teams in a fully remote, global environment. Bachelor's degree in Computer Science or related field, OR equivalent practical experience demonstrating technical aptitude and problem-solving abilities.

Responsibilities:

Continuously monitor alerting channels (PagerDuty, DataDog, CloudWatch, Prometheus/Grafana), validate alerts, filter false positives, and provide first-line support for site operations and infrastructure issues. Serve as the communication hub during incidents, providing regular status updates to stakeholders, escalating verified incidents to appropriate on-call teams, and maintaining incident bridges with proper handoffs. Execute documented runbooks and standard operating procedures for common issues, handling infrastructure access requests, basic troubleshooting, and deployment support activities. Investigate initial security alerts, monitor application performance, and process routine change requests, configuration updates, and maintenance tasks across operational teams. Create and maintain operational runbooks, update documentation based on incident learnings, and contribute to post-incident reviews to drive continuous improvement. Assist with monitoring configuration including adding new monitors, adjusting alert thresholds, and optimizing alerting systems to reduce noise and improve signal quality. Work independently during off-hours shifts in a remote, global team environment while maintaining strong collaboration and knowing when to escalate complex issues.