Senior Site Reliability Engineer
Canada, Global collaboration across multiple time zonesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Experience
- 6+ years
- Required Skills
- PythonBashKubernetesRubyGoLinuxDevOpsAnsible
Requirements
- 6+ years of experience in Site Reliability Engineering, DevOps, or infrastructure operations roles within complex distributed systems.
- Strong proficiency in Linux systems administration, troubleshooting, and performance tuning.
- Experience with scripting languages such as Python, Bash, Go, or Ruby for automation and operational tooling.
- Hands-on experience with configuration management tools such as Puppet or Ansible.
- Solid understanding of distributed systems, caching technologies, and system optimization techniques.
- Experience with Linux package management (e.g., Debian-based systems).
- Proven track record of automating operational processes and identifying opportunities for system improvement.
- Experience participating in incident response, postmortems, and reliability engineering practices.
- Strong communication skills in English, with the ability to work effectively in a fully remote, globally distributed team.
- Ability to work independently while collaborating across multiple time zones and teams.
Responsibilities
- Perform day-to-day operations and DevOps responsibilities across large-scale public-facing infrastructure, including deployment, configuration, maintenance, and troubleshooting.
- Manage and optimize configuration and deployment systems using tools such as Puppet and Kubernetes.
- Automate infrastructure provisioning, service deployment, and operational workflows to improve reliability and efficiency.
- Collaborate with product and engineering teams to design scalable architectures and ensure systems operate reliably under global traffic loads.
- Participate in a 24/7 on-call rotation, handling incident response, system alerts, troubleshooting, and post-incident reviews.
- Conduct root cause analysis of production incidents and implement preventive measures to improve system stability.
- Contribute to system monitoring, observability, and performance optimization initiatives.
- Mentor engineers and share operational expertise within a distributed, cross-functional team environment.
- Work asynchronously with global teams while ensuring clear and effective technical communication.
View Full Description & ApplyYou'll be redirected to the employer's site