Apply

Site Reliability Engineer

Posted 2024-11-08

View full description

πŸ’Ž Seniority level: Senior, At least 5 years

πŸ“ Location: Canada

πŸ” Industry: Supply chain solutions

🏒 Company: Tecsys Inc.

πŸ—£οΈ Languages: English

⏳ Experience: At least 5 years

πŸͺ„ Skills: AWSJavaJenkinsAzure.NETCommunication SkillsDocumentationCompliance

Requirements:
  • Bachelor's degree in computer science or related technical discipline.
  • At least 5 years of experience in systems engineering, with experience in platform development, orchestration, product ownership, and iterative design and deployment.
  • Experience designing and deploying large-scale systems and multi-vendor platforms.
  • Strong knowledge of system design, high-performance computing, storage technologies, and integrating compute, storage, and network technologies.
  • Experience with full stack automation and reducing manual intervention.
  • Self-organized and collaborative, managing efforts across various teams and geographies.
  • Knowledge of Datadog and Rapid7 Insight preferred.
  • Knowledge and experience with AWS or Azure required.
  • Basic knowledge of Java or .Net-based development is necessary.
  • Knowledge of GitLab preferred, or at least Jenkins required.
  • Experience with SaaS companies is an asset, along with Fedramp compliance experience.
  • Strong English communication skills, both written and spoken.
Responsibilities:
  • Collaborate with other Engineering teams to support services before they go live through system design consulting, software platform development, capacity planning, and launch reviews.
  • Maintain services post-launch by measuring availability, latency, and overall system health.
  • Develop tools & automation on Azure & AWS to reduce manual intervention.
  • Scale systems through automation and enhance reliability and velocity.
  • Participate in on-call rotation and conduct blameless postmortems for incident response.
  • Implement CI/CD solutions, monitoring, logging, alerting, and SLA reporting.
  • Create technical documentation and apply SRE best practices.
  • Take command of high-severity incidents and facilitate their resolution.
  • Support planning and deployment teams to enable stability and scale.
  • Work cross-functionally with internal teams and vendors.
Apply

Related Jobs

Apply

πŸ“ Canada

πŸ” Software Supply Chain Management

🏒 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Security and fraud detection

🏒 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

πŸ“ Canada, United States

πŸ” Cyber Security

🏒 Company: BeyondTrust

  • Experience in designing and building enterprise-ready cloud-native platforms, with a passion for researching and managing solutions.
  • High standards with continuous improvement towards high-quality products, services, and processes.
  • Ability to simplify complexity and empower development teams.
  • Decision-making based on data with a focus on balancing speed and risk.
  • Understanding of the importance of observability and metric dashboards.
  • Technical familiarity with AWS Cloud Resources (S3, EC2, EKS, RDS, etc.), Service Mesh (Istio), Infrastructure as Code (Terraform, AWS CDK), and Continuous Delivery tools (ArgoCD, GitHub Actions).

  • Define a platform for engineering teams to utilize automated, self-service, scalable, efficient, observable, and reliable infrastructure services as a product.
  • Design long-term technical solutions and cross-team mechanisms to achieve reliability goals.
  • Provide expert technical guidance and feedback during engineering design reviews using observability tools.
  • Deliver common, reusable tools, capabilities, and interfaces to the cloud platform solution.
  • Collaborate with SREs and senior engineers on best practices.
  • Align and help drive execution of the Platform Infrastructure team’s strategy.
  • Reduce toil through automation.

AWSLeadershipAmazon RDSAWS EKSCloud ComputingElasticSearchAmazon Web ServicesElasticsearchCI/CD

Posted 2024-10-21
Apply
Apply

πŸ“ EMEA, APAC, AMER

πŸ” DevSecOps

🏒 Company: GitLab

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with infrastructure and database automation using Terraform.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Experience implementing monitoring at scale using Prometheus and Grafana.
  • Ability to promote GitLab's CREDIT values in work.
  • Superior verbal and written communication skills.
  • Comfortable working asynchronously across timezones.

  • Build: Automating operational tasks like package updates and configuration changes.
  • Maintain: Develop systems for reliable maintenance tasks like library upgrades.
  • Plan: Create monitoring systems to predict capacity needs.
  • Respond: Address user emergencies and support requests.
  • Enhance: Update security measures for GitLab's infrastructure.
  • Partner: Collaborate with internal teams on compliance assessments and improvements.
  • Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 2024-10-16
Apply
Apply

πŸ“ APAC, EMEA, AMER

πŸ” DevSecOps Software

🏒 Company: GitLab

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with automation including developing infrastructure and database automations.
  • Experience with Terraform for automation.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Solid experience implementing monitoring at scale, preferably using Prometheus and Grafana.
  • Willingness and ability to promote GitLab's CREDIT Values.
  • Superior verbal and written communication skills.
  • Ability to work asynchronously across timezones and cultures.

  • Build, Run, and own the entire lifecycle of the PostgreSQL database engine for GitLab.com.
  • Automate operational tasks including package updates and configuration changes.
  • Develop warning systems for maintenance tasks like library upgrades.
  • Create monitoring and alerting systems to predict capacity needs.
  • Respond to user emergencies and support requests.
  • Implement and enhance security measures for GitLab infrastructure.
  • Partner with compliance assessors for regulatory certifications.
  • Collaborate with engineering teams to resolve architectural bottlenecks.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaborationTerraform

Posted 2024-10-16
Apply
Apply

πŸ“ Canada

πŸ” Software development for small businesses

🏒 Company: Jobber

  • Demonstrated expertise in providing systems support within a cloud environment, preferably AWS and its various services.
  • Experience with IaC using Terraform.
  • Experience optimizing and improving continuous deployment performance.
  • Ability to juggle multiple projects and incident management.
  • Relevant experience in programming languages such as Ruby, Python, or Bash.
  • Deep passion for learning, reliability, automation, orchestration, and continuous improvement.
  • Strong commitment to problem-solving and keen interest in technology.
  • Exceptional interpersonal skills for collaboration in high-pressure situations.

  • Collaborate on the design, implementation, operation, and maintenance of AWS infrastructure.
  • Leverage Infrastructure-as-Code principles.
  • Develop and maintain local tooling for development and observability tools, including deployment tools and interfaces into AWS, CircleCI, and other infrastructure components.
  • Participate in on-call rotation and contribute to enhancing the on-call experience for the team.

AWSPythonBashRubyRuby on RailsReactCollaborationProblem SolvingTerraform

Posted 2024-10-16
Apply
Apply

πŸ“ Australia, Mexico, Italy, Nigeria, Canada, USA

🧭 Full-Time

πŸ’Έ 167471 USD per year

πŸ” Software

🏒 Company: Float.com

  • An senior-level understanding of how SRE operates as an enabling team.
  • A very good understanding of Service Level Objectives (SLOs).
  • Extensive knowledge of Kafka administration.
  • Working experience with Terraform, Bash, and a programming language (ideally PHP, NodeJS, or Python).
  • Experience with Kubernetes and Google Cloud Platform (GCP) is highly valued.
  • Previous remote experience and comfort with asynchronous communication tools like Slack, Loom, and Linear.

  • Continue supporting the regular maintenance of all the engineering systems.
  • Identify areas requiring support to scale.
  • Improve service resilience within product and engineering teams.
  • Optimize monitoring and observability stack with standardized tools and configurations.
  • Understand Float’s SLOs and build SLO patterns and procedures.
  • Build a disaster recovery program using chaos engineering techniques.
  • Migrate deployment configurations to a global single source of truth.
  • Expand infrastructure across multiple regions.

PHPPythonBashGCPKafkaKubernetesTerraform

Posted 2024-09-20
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ $139,000 - $218,000 per year

πŸ” Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 2024-09-19
Apply
Apply

πŸ“ North America

🧭 Full-Time

πŸ” Incident Management Platform

🏒 Company: Rootly

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 2024-09-13
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109000 - 169000 USD per year

πŸ” Nonprofit, Technology

  • Proficient at automation/programming/scripting skills.
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
  • Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
  • Experience with managing remotely both bare-metal servers and virtualized environments.
  • 5+ years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience with high traffic and highly available website architectures and operations.
  • Strong English language skills.
  • Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
  • B.Sc. or M.Sc. in Computer Science or equivalent work experience.

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
  • Improving observability (alerting, metrics, monitoring) of database infrastructure.
  • Multi-datacenter systems design, capacity and infrastructure planning.
  • Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28
Apply