Staff Site Reliability Engineer

Posted 10 months agoViewed

Costa RicaFull-TimeInformation services

Company:

Location:Costa Rica

Languages:English

Seniority level:Staff, 5+ years

Experience:5+ years

Skills:

AWSDockerPythonGitJenkinsKubernetesPrometheusLinux

Requirements:

5+ years of direct experience supporting complex scaled systems in production. Linux knowledge with experience in troubleshooting and predicting issues. Networking, troubleshooting, and monitoring skills. Experience with cloud-native application designs for performance and resilience. Skills in incident management and coordination, including blameless post-incident reviews. Familiarity with technologies like Kubernetes, Splunk, Dynatrace, ServiceNow, Jira, Jenkins, Python, Prometheus, Java, Cassandra, Redis, MongoDB, AWS, and Infrastructure as Code.

Responsibilities:

Uptime of Experian One – Experian's Cloud SaaS offering for Decision Analytics. Monitor and provide alerts for platform performance. Respond to incidents and restore service promptly. Understand systems to assess issues and allocate problem resolution. Identify and eliminate manual processes to prevent recurrence. Manage incidents and coordinate during service disruptions. Write complex queries using various tools. Review systems designs to identify resiliency, scalability, and monitoring issues.