Apply

Site Reliability Engineer

Posted 2 months agoViewed

View full description

πŸ’Ž Seniority level: Senior, At least 5 years

πŸ“ Location: Canada

πŸ” Industry: Supply chain solutions

🏒 Company: Tecsys Inc.

πŸ—£οΈ Languages: English

⏳ Experience: At least 5 years

πŸͺ„ Skills: AWSJavaJenkinsAzure.NETCommunication SkillsDocumentationCompliance

Requirements:
  • Bachelor's degree in computer science or related technical discipline.
  • At least 5 years of experience in systems engineering, with experience in platform development, orchestration, product ownership, and iterative design and deployment.
  • Experience designing and deploying large-scale systems and multi-vendor platforms.
  • Strong knowledge of system design, high-performance computing, storage technologies, and integrating compute, storage, and network technologies.
  • Experience with full stack automation and reducing manual intervention.
  • Self-organized and collaborative, managing efforts across various teams and geographies.
  • Knowledge of Datadog and Rapid7 Insight preferred.
  • Knowledge and experience with AWS or Azure required.
  • Basic knowledge of Java or .Net-based development is necessary.
  • Knowledge of GitLab preferred, or at least Jenkins required.
  • Experience with SaaS companies is an asset, along with Fedramp compliance experience.
  • Strong English communication skills, both written and spoken.
Responsibilities:
  • Collaborate with other Engineering teams to support services before they go live through system design consulting, software platform development, capacity planning, and launch reviews.
  • Maintain services post-launch by measuring availability, latency, and overall system health.
  • Develop tools & automation on Azure & AWS to reduce manual intervention.
  • Scale systems through automation and enhance reliability and velocity.
  • Participate in on-call rotation and conduct blameless postmortems for incident response.
  • Implement CI/CD solutions, monitoring, logging, alerting, and SLA reporting.
  • Create technical documentation and apply SRE best practices.
  • Take command of high-severity incidents and facilitate their resolution.
  • Support planning and deployment teams to enable stability and scale.
  • Work cross-functionally with internal teams and vendors.
Apply

Related Jobs

Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 108000.0 - 163900.0 USD per year

πŸ” Active Insurance, Digital Risk Management

🏒 Company: Coalition, Inc.

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles in a full stack engineering environment.
  • Strong understanding of AWS services (e.g., EC2, S3, RDS, Lambda, VPC).
  • Hands-on experience with IaC tools like Terraform, CloudFormation, or CDK.
  • Experience with containerization and orchestration tools such as ECS and Kubernetes.
  • Experience working with fault-tolerant services and developing highly available systems.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Some knowledge of software engineering design patterns, agile development, and architecture principles.
  • Strong analytical and problem-solving skills.

  • Play a pivotal role in ensuring the performance, availability, and efficiency of cloud-based systems.
  • Design, implement, and manage robust cloud solutions.
  • Automate infrastructure and build developer-friendly platforms.
  • Optimize cloud resources and improve system observability.
  • Drive operational excellence across the organization.
  • Participate in a low-volume on-call rotation to maintain system reliability.

AWSDockerPythonKubernetesGoCI/CDTerraform

Posted 17 days ago
Apply
Apply

πŸ“ United States, Canada

πŸ’Έ 108000.0 - 163900.0 USD per year

πŸ” Insurance, Cybersecurity

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
  • Strong understanding of AWS services and best practices.
  • Hands-on experience with IaC tools like Terraform or CloudFormation.
  • Experience with containerization tools such as ECS or Kubernetes.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Strong analytical and problem-solving skills.

  • Design, implement, and manage robust cloud solutions.
  • Work closely with cross-functional teams.
  • Isolate, trap, and respond to system failures.
  • Develop strategies for continuous monitoring and analysis.
  • Participate in a low-volume on-call rotation to maintain reliability.

AWSDockerPythonJavaKafkaKubernetesGoCI/CDTerraform

Posted 18 days ago
Apply
Apply

πŸ“ United States, Canada

πŸ” Active Insurance, Digital Risk Management

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
  • Strong understanding of AWS services and best practices for building scalable and secure infrastructure.
  • Experience with IaC tools like Terraform, CloudFormation, or CDK.
  • Hands-on experience with containerization and orchestration tools such as ECS or Kubernetes.
  • Experience with fault tolerant services and highly-available systems.
  • Understanding of CI/CD pipelines and security auditability.

  • Play a pivotal role in ensuring performance, availability, and efficiency of cloud-based systems.
  • Design, implement, and manage robust cloud solutions.
  • Automate infrastructure and build developer-friendly platforms.
  • Participate in a low-volume on-call rotation to maintain system reliability.
  • Develop strategies for continuous monitoring and minimize downtime.

AWSPythonKafkaKubernetesGoCI/CDTerraform

Posted 18 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Observability and data management

🏒 Company: CriblπŸ‘₯ 251-500πŸ’° $150,000,000 Series D over 2 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted 28 days ago
Apply
Apply

πŸ“ Canada

πŸ” Financial Technology and HR

🏒 Company: ZayZoonπŸ‘₯ 51-100πŸ’° $11,042,809 Series B 10 months agoFinancial ServicesFinTechSoftware

  • 5+ years infrastructure experience.
  • 2+ years AWS experience including certification and deployment of production applications.
  • Proficiency with Infrastructure as Code (IaC), specifically CloudFormation.
  • Experience with containerization technologies such as Docker, ECS, and ECR.
  • Experience analyzing and addressing performance issues using observability platforms like DataDog, NewRelic, and OTel.
  • Strong SQL and data analysis skills with problem-solving focus.

  • Develop and maintain infrastructure-as-code CloudFormation templates, emphasizing serverless resources.
  • Instrument and analyze daily metrics of both infrastructure performance and applications using AWS tools and third-party platforms.
  • Manage deployment pipelines including blue/green deployment and auto-scaling.
  • Maintain resource dependencies, especially databases, including updates and downtime planning.
  • Project costs and implement cost-saving programs in AWS.
  • Collaborate with risk and security teams to ensure compliance.
  • Work closely with app and data engineers on shared metrics and load testing.
  • Participate in the agile development process.

AWSDockerSQLData AnalysisETLCI/CD

Posted 29 days ago
Apply
Apply

πŸ“ Canada, Chile

πŸ” Technology

🏒 Company: Launchpad Technologies

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Familiarity with monitoring tools and systems.
  • Proficient in scripting languages such as Python, Bash, or Ruby.
  • Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
  • Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
  • Excellent troubleshooting and analytical skills.
  • Strong communication skills and the ability to work effectively within a team.

  • Develop, maintain, and improve automated deployment, certification, and validation pipelines.
  • Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
  • Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
  • Manage third-party services and technologies used to support the SRE discipline.
  • Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
  • Define and implement an observability framework to provide insights into system performance and behavior.
  • Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
  • Own operational incident management, providing support to related teams and individuals during incident resolution.
  • Identify and implement best practices for system reliability, security, scalability, and performance.
  • Participate in on-call rotations for system support, troubleshooting, and resolution.
  • Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
  • Develop and maintain documentation for systems, processes, and procedures.

AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Posted about 2 months ago
Apply
Apply

πŸ“ US and Canada

🧭 Full-Time

πŸ’Έ 150000 - 200000 USD per year

πŸ” Healthcare

🏒 Company: Synthesis HealthπŸ‘₯ 51-100πŸ’° Seed about 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted about 2 months ago
Apply
Apply

πŸ“ EMEA, APAC, AMER

πŸ” DevSecOps

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff almost 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with infrastructure and database automation using Terraform.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Experience implementing monitoring at scale using Prometheus and Grafana.
  • Ability to promote GitLab's CREDIT values in work.
  • Superior verbal and written communication skills.
  • Comfortable working asynchronously across timezones.

  • Build: Automating operational tasks like package updates and configuration changes.
  • Maintain: Develop systems for reliable maintenance tasks like library upgrades.
  • Plan: Create monitoring systems to predict capacity needs.
  • Respond: Address user emergencies and support requests.
  • Enhance: Update security measures for GitLab's infrastructure.
  • Partner: Collaborate with internal teams on compliance assessments and improvements.
  • Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 3 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ $139,000 - $218,000 per year

πŸ” Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 4 months ago
Apply
Apply

πŸ“ North America

🧭 Full-Time

πŸ” Incident Management Platform

🏒 Company: RootlyπŸ‘₯ 11-50πŸ’° $12,000,000 Series A over 1 year agoDeveloper ToolsDeveloper PlatformProductivity ToolsSaaSInformation TechnologySoftware

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 4 months ago
Apply