Apply

Senior Site Reliability Engineer

Posted 24 days agoViewed

View full description

๐Ÿ’Ž Seniority level: Senior, Minimum of 5 years

๐Ÿ“ Location: Canada, Chile

๐Ÿ” Industry: Technology

๐Ÿข Company: Launchpad Technologies

โณ Experience: Minimum of 5 years

๐Ÿช„ Skills: AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Requirements:
  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Familiarity with monitoring tools and systems.
  • Proficient in scripting languages such as Python, Bash, or Ruby.
  • Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
  • Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
  • Excellent troubleshooting and analytical skills.
  • Strong communication skills and the ability to work effectively within a team.
Responsibilities:
  • Develop, maintain, and improve automated deployment, certification, and validation pipelines.
  • Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
  • Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
  • Manage third-party services and technologies used to support the SRE discipline.
  • Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
  • Define and implement an observability framework to provide insights into system performance and behavior.
  • Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
  • Own operational incident management, providing support to related teams and individuals during incident resolution.
  • Identify and implement best practices for system reliability, security, scalability, and performance.
  • Participate in on-call rotations for system support, troubleshooting, and resolution.
  • Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
  • Develop and maintain documentation for systems, processes, and procedures.
Apply

Related Jobs

Apply

๐Ÿ“ Canada

๐Ÿ” Financial Technology and HR

๐Ÿข Company: ZayZoon๐Ÿ‘ฅ 51-100๐Ÿ’ฐ $11,042,809 Series B 9 months agoFinancial ServicesFinTechSoftware

  • 5+ years infrastructure experience.
  • 2+ years AWS experience including certification and deployment of production applications.
  • Proficiency with Infrastructure as Code (IaC), specifically CloudFormation.
  • Experience with containerization technologies such as Docker, ECS, and ECR.
  • Experience analyzing and addressing performance issues using observability platforms like DataDog, NewRelic, and OTel.
  • Strong SQL and data analysis skills with problem-solving focus.

  • Develop and maintain infrastructure-as-code CloudFormation templates, emphasizing serverless resources.
  • Instrument and analyze daily metrics of both infrastructure performance and applications using AWS tools and third-party platforms.
  • Manage deployment pipelines including blue/green deployment and auto-scaling.
  • Maintain resource dependencies, especially databases, including updates and downtime planning.
  • Project costs and implement cost-saving programs in AWS.
  • Collaborate with risk and security teams to ensure compliance.
  • Work closely with app and data engineers on shared metrics and load testing.
  • Participate in the agile development process.

AWSDockerSQLData AnalysisETLCI/CD

Posted 8 days ago
Apply
Apply

๐Ÿ“ Canada

๐Ÿงญ Full-Time

๐Ÿ” Data integration technology

๐Ÿข Company: Supermetrics๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $47,174,818 Series B over 4 years agoSaaSAnalyticsB2BMarketingEnterprise SoftwareSoftware

  • 4+ years of experience in Site Reliability Engineering, Platform Engineering, or similar roles.
  • Strong understanding of containers and experience with Kubernetes at scale.
  • Experience operating production databases, both relational and NoSQL.
  • In-depth knowledge of Linux systems and Terraform.
  • Experience with AWS and/or GCP cloud services.
  • Familiarity with observability tools and practices.
  • Automation mindset with scripting skills in Python or Bash.
  • Good communication skills, particularly in writing documentation and PRs.
  • Strong problem-solving skills and passion for relevant tools and technologies.

  • Raise the team's expertise in Kubernetes by mentoring and guiding colleagues.
  • Operate the platform for SaaS products used by thousands of businesses.
  • Define SLAs and SLOs and drive automation to meet them.
  • Write and review Terraform configurations and modules for Kubernetes.
  • Develop and maintain internal deployment Helm charts.
  • Respond to production incidents and support internal users.
  • Assist the pre-sales team in addressing customer questions about architecture and data security.
  • Review and discuss architecture changes involving new databases.
  • Enhance deployment processes using GitOps.
  • Participate in on-call rotations for incident support.

AWSPostgreSQLGCPKubernetesMySQLGoRedisCI/CDLinuxTerraform

Posted 9 days ago
Apply
Apply

๐Ÿ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

๐Ÿ’ธ 109047 - 169455 USD per year

๐Ÿ” Nonprofit Organization, Technology

๐Ÿข Company: Wikimedia Foundation๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $2,100,000 Grant almost 5 years ago

  • At least two years of experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Knowledge of configuration management and orchestration tools (e.g., Puppet, Ansible).
  • Familiarity with observability infrastructure (monitoring, metrics, logging).
  • Proficient in shell and scripting languages (e.g., Python, Go, Bash, Ruby).
  • Understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • Deployment, configuration, and maintenance of distributed data systems for the data and analytics platform.
  • Implement data quality monitoring to alert the team of possible data issues.
  • Collaborate with Fundraising to integrate data from various self-hosted and third-party sources.
  • Provide engineering support during high-traffic campaigns.
  • Document internal systems and processes.
  • Ensure compliance with relevant regulations, such as Donor Privacy Policy, GDPR, and PCI DSS.
  • Manage users and permissions for data access control.
  • Advise on best practices for data input and streamline processes.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationLinuxDevOpsDocumentationCompliance

Posted 26 days ago
Apply
Apply

๐Ÿ“ US and Canada

๐Ÿงญ Full-Time

๐Ÿ’ธ 150000 - 200000 USD per year

๐Ÿ” Healthcare

๐Ÿข Company: Synthesis Health๐Ÿ‘ฅ 51-100๐Ÿ’ฐ Seed almost 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted about 1 month ago
Apply
Apply

๐Ÿ“ Brazil, Portugal

๐Ÿ” Wellness

  • Bachelorโ€™s degree in computer science or equivalent professional experience;
  • Technical experience with AWS cloud services and software engineering;
  • Good knowledge about Kubernetes and their ecosystem;
  • Experience with operator managed Infrastructure as Code, preferably Crossplane;
  • Ability writing software for production environments;
  • Excellent analytical and problem-solving skills, proven experience in identifying solutions for complex problems;
  • Collaboration and learning driven mindset;
  • CNCF Kubernetes Certifications (e.g. CKA, CKS or CKAD);
  • AWS Certifications;
  • You have well developed communication skills, you are capable of clearly articulate ideas when communicate to groups;
  • Ability to communicate in English.

  • Help to build a global, secure, scalable and cost effective Cloud platform using Kubernetes in AWS;
  • Develop and evolve Kubernetes operators and others cloud native automations in Kubernetes;
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously;
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations;
  • Improve observability, reliability and cost awareness;
  • Support engineering teams in the products and tools usage;
  • Build and maintain a modern CI/CD set of tools and services;
  • Keep all the Kubernetes clusters highly-available and reliable;
  • Contribute with our products documentation (e.g. user guide, configurations, operations and troubleshooting procedures);
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices;
  • Live the mission: inspire and empower others by genuinely caring for your own wellbeing and your colleagues.

AWSKubernetesCommunication SkillsCollaborationCI/CDProblem Solving

Posted 3 months ago
Apply
Apply

๐Ÿ“ United States, Canada

๐Ÿงญ Full-Time

๐Ÿ’ธ $139,000 - $218,000 per year

๐Ÿ” Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflowโ€™s Customer Support, Partnerships, and Sales teams to enable customers using Webflowโ€™s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 3 months ago
Apply
Apply

๐Ÿ“ North America

๐Ÿงญ Full-Time

๐Ÿ” Incident Management Platform

๐Ÿข Company: Rootly๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $12,000,000 Series A over 1 year agoDeveloper ToolsDeveloper PlatformProductivity ToolsSaaSInformation TechnologySoftware

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • Youโ€™ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 4 months ago
Apply
Apply

๐Ÿ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

๐Ÿงญ Full-Time

๐Ÿ’ธ 109047 - 169455 USD per year

๐Ÿ” Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationC (Programming language)

Posted 4 months ago
Apply