Site Reliability Engineer

Posted 7 days agoViewed
APACContractSoftware Development
Company:Rocket.Chat
Location:APAC
Languages:English
Seniority level:Senior
Skills:
AWSPythonBashGCPKubernetesMongoDBAzureGoGrafanaPrometheusRedisCI/CDLinuxDevOpsTerraformNetworkingAnsibleSoftware Engineering
Requirements:
Strong software engineering background with expertise in large-scale distributed systems. Expertise in Kubernetes, including operator development. Expertise in cloud platforms (e.g., AWS, GCP, Azure, OVH). Proficiency in Go, Python, or Bash for tooling and operator development. Deep, hands-on experience with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki). Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Pulumi, Ansible). Experience with CI/CD pipelines using tools like ArgoCD. Solid understanding of networking fundamentals (TCP/IP, DNS, routing) and security principles. Familiarity with database technologies such as MongoDB or Redis.
Responsibilities:
Design, develop, and maintain Kubernetes Operators for managed hosting. Manage reliability and performance of core infrastructure including Kubernetes clusters and services. Define, monitor, and enforce SLOs, manage error budgets, and implement monitoring solutions. Develop automation frameworks for deployment, configuration, and operational tasks. Lead incident management and on-call response, conduct blameless post-mortems. Collaborate with Engineering, Security, and QA to integrate reliability best practices. Conduct proactive load testing, performance analysis, and chaos engineering experiments.
Similar Jobs:
Posted 25 minutes ago
PakistanFull-TimeSoftware Development
Sr. QA Engineer (Automation)
Company:Remotebase
Posted about 1 hour ago
IndonesiaFull-TimeSoftware Development
Software Engineer
Company:DoiT
Posted about 2 hours ago
SingaporeFull-TimeAI Observability
AI Solutions Engineer (APJ, Singapore)
Company:Arize AI