Senior Reliability Engineer

Posted 5 days agoViewed

View full description

💎 Seniority level: Senior, 7+ years

📍 Location: Canada

🔍 Industry: SaaS

🏢 Company: hive.co

🗣️ Languages: English

⏳ Experience: 7+ years

🪄 Skills: AWSDockerPythonSQLDjangoElasticSearchKubernetesMongoDBMySQLAlgorithmsClickhouseData StructuresRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesSaaS

Requirements:

7+ years of software engineering experience, with at least 5 years focused on reliability, infrastructure, or platform engineering
3+ years experience with AWS and proven ability to build effective monitoring, alerting, and observability solutions
Track record of implementing, maintaining, and improving SLOs and uptime KPIs for critical services
Expert knowledge of Linux, Docker, and distributed systems principles with their real-world applications
Solid programming skills in both application and infrastructure languages (Python, Go, etc.)
Strong grasp of security best practices and a data-driven approach to enhancing stability and availability
Excellent communication skills with the ability to collaborate effectively across teams and explain complex technical concepts clearly

Responsibilities:

Champion system observability improvements through implementation, maintenance, process refinement, and automation for business-critical services
Drive SLO adoption and improvement to ensure excellent customer satisfaction across key value streams
Enhance application performance at every level, from infrastructure foundations to runtime environments
Tackle and resolve complex technical challenges across the entire stack
Partner with development teams to design and implement scalable, reliable solutions
Lead security and compliance initiatives as integral components of our engineering practice
Craft and refine developer tools that boost team productivity and efficiency
Develop and implement strategies to optimize cloud infrastructure costs
Collaborate with DevOps to maintain and enhance deployment pipelines in our cloud environments
Contribute to incident management by defining meaningful metrics, executing against targets, and improving response times and overall system stability

Apply