Apply

Senior Site Reliability Engineer

Posted 2024-10-16

View full description

📍 Location: Canada

🔍 Industry: Software development for small businesses

🏢 Company: Jobber

🪄 Skills: AWSPythonBashRubyRuby on RailsReactCollaborationProblem SolvingTerraform

Requirements:
  • Demonstrated expertise in providing systems support within a cloud environment, preferably AWS and its various services.
  • Experience with IaC using Terraform.
  • Experience optimizing and improving continuous deployment performance.
  • Ability to juggle multiple projects and incident management.
  • Relevant experience in programming languages such as Ruby, Python, or Bash.
  • Deep passion for learning, reliability, automation, orchestration, and continuous improvement.
  • Strong commitment to problem-solving and keen interest in technology.
  • Exceptional interpersonal skills for collaboration in high-pressure situations.
Responsibilities:
  • Collaborate on the design, implementation, operation, and maintenance of AWS infrastructure.
  • Leverage Infrastructure-as-Code principles.
  • Develop and maintain local tooling for development and observability tools, including deployment tools and interfaces into AWS, CircleCI, and other infrastructure components.
  • Participate in on-call rotation and contribute to enhancing the on-call experience for the team.
Apply

Related Jobs

Apply

📍 Canada

🔍 Software Supply Chain Management

🏢 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 2024-09-19
Apply
Apply

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 2024-09-13
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109000 - 169000 USD per year

🔍 Nonprofit, Technology

  • Proficient at automation/programming/scripting skills.
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
  • Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
  • Experience with managing remotely both bare-metal servers and virtualized environments.
  • 5+ years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience with high traffic and highly available website architectures and operations.
  • Strong English language skills.
  • Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
  • B.Sc. or M.Sc. in Computer Science or equivalent work experience.

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
  • Improving observability (alerting, metrics, monitoring) of database infrastructure.
  • Multi-datacenter systems design, capacity and infrastructure planning.
  • Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109000 - 169000 USD per year

🔍 Nonprofit, knowledge sharing

  • Proficient at automation/programming/scripting skills
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.)
  • Advanced knowledge of Linux and IO/data storage concepts
  • Experience with managing remotely both bare-metal servers and virtualized environments
  • 5+ years experience in an SRE/Operations/DevOps role
  • Experience with high traffic and highly available website architectures
  • Strong English language skills
  • Ability to work independently in a fast paced environment

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution
  • Improving observability of database infrastructure
  • Designing multi-datacenter systems, capacity planning, and infrastructure planning
  • Participating in incident response and on-call rotation for system outages or alerts

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedisLinux

Posted 2024-08-28
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109047 - 169455 USD per year

🔍 Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyC (Programming language)Data engineeringGoCommunication SkillsCollaboration

Posted 2024-08-22
Apply