Apply

Senior Reliability Engineer

Posted 5 days agoViewed

View full description

💎 Seniority level: Senior, 7+ years

📍 Location: Canada

🔍 Industry: SaaS

🏢 Company: hive.co

🗣️ Languages: English

⏳ Experience: 7+ years

🪄 Skills: AWSDockerPythonSQLDjangoElasticSearchKubernetesMongoDBMySQLAlgorithmsClickhouseData StructuresRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesSaaS

Requirements:
  • 7+ years of software engineering experience, with at least 5 years focused on reliability, infrastructure, or platform engineering
  • 3+ years experience with AWS and proven ability to build effective monitoring, alerting, and observability solutions
  • Track record of implementing, maintaining, and improving SLOs and uptime KPIs for critical services
  • Expert knowledge of Linux, Docker, and distributed systems principles with their real-world applications
  • Solid programming skills in both application and infrastructure languages (Python, Go, etc.)
  • Strong grasp of security best practices and a data-driven approach to enhancing stability and availability
  • Excellent communication skills with the ability to collaborate effectively across teams and explain complex technical concepts clearly
Responsibilities:
  • Champion system observability improvements through implementation, maintenance, process refinement, and automation for business-critical services
  • Drive SLO adoption and improvement to ensure excellent customer satisfaction across key value streams
  • Enhance application performance at every level, from infrastructure foundations to runtime environments
  • Tackle and resolve complex technical challenges across the entire stack
  • Partner with development teams to design and implement scalable, reliable solutions
  • Lead security and compliance initiatives as integral components of our engineering practice
  • Craft and refine developer tools that boost team productivity and efficiency
  • Develop and implement strategies to optimize cloud infrastructure costs
  • Collaborate with DevOps to maintain and enhance deployment pipelines in our cloud environments
  • Contribute to incident management by defining meaningful metrics, executing against targets, and improving response times and overall system stability
Apply