Senior Site Reliability Engineer

Posted 2024-11-17

View full description

💎 Seniority level: Senior

📍 Location: Canada

🔍 Industry: Software Supply Chain Management

🏢 Company: FOSSA

🗣️ Languages: English

🪄 Skills: AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Requirements:

Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
Experience with packaging and deploying services using Docker on Linux.
Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
Willingness to accept feedback and incorporate it into work.
Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
Willingness to take part in an on-call rotation.

Responsibilities:

Scale cloud infrastructure to meet increasing demand.
Assist development teams in deploying new services.
Ensure platform security and adherence to best practices.
Improve development tools, CI/CD pipelines, monitoring, and release processes.
Help teams use Helm and Kubernetes, and shape best practices.
Build access control and secret management solutions.
Maintain deployments for on-premise customers.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer, Data Engineering

Posted 2024-12-03

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

💸 109047 - 169455 USD per year

🔍 Nonprofit Organization, Technology

🏢 Company: Wikimedia Foundation

At least two years of experience in an SRE/Operations/DevOps role as part of a team.
Experience supporting high availability distributed production systems.
Experience with database administration and support.
Knowledge of configuration management and orchestration tools (e.g., Puppet, Ansible).
Familiarity with observability infrastructure (monitoring, metrics, logging).
Proficient in shell and scripting languages (e.g., Python, Go, Bash, Ruby).
Understanding of Linux/Unix fundamentals and debugging skills.
Excellent written and verbal communication skills.
BS or MS degree in Computer Science or equivalent work experience.

Deployment, configuration, and maintenance of distributed data systems for the data and analytics platform.
Implement data quality monitoring to alert the team of possible data issues.
Collaborate with Fundraising to integrate data from various self-hosted and third-party sources.
Provide engineering support during high-traffic campaigns.
Document internal systems and processes.
Ensure compliance with relevant regulations, such as Donor Privacy Policy, GDPR, and PCI DSS.
Manage users and permissions for data access control.
Advise on best practices for data input and streamline processes.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationLinuxDevOpsDocumentationCompliance

Posted 2024-12-03

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-24

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health

Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
At least one year of experience as a Python developer transitioning to an SRE role.
Five years of experience in software development as a DevOps and/or SRE.
Two years of experience in an SRE role with Kubernetes, preferably GKE.
Experience using ArgoCD for rollouts and deployments.
One year experience with service mesh like Istio in a GKE environment.
Proficiency in scripting languages like Python and automation tools like Terraform.
Solid understanding of security best practices for pipelines and cloud environments.
Familiarity with compliance standards like SOC 2, HIPAA.
Strong expertise in CI/CD pipeline management.

Design and implement automated application deployment processes.
Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
Manage development, testing, staging, pre-production and production environments.
Automate repetitive deployment tasks to improve productivity.
Select, develop, and monitor CI/CD systems.
Oversee software automation across GCP.
Containerize services to optimize resources and deployment speed.
Manage and optimize cloud infrastructure for cost and performance.
Ensure compliance with security standards and maintain disaster recovery plans.
Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2024-11-24

Apply

🔥 Senior Site Reliability Engineer - US/Canada

Posted 2024-11-09

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

5+ years of experience with production environment running Linux.
3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
Familiarity with big data technologies such as Spark and/or Flink.
Passion for automating tasks through coding and scripting.
Experience with algorithms, data structures, complexity analysis, and software design.
Proficient coding skills in Python, Java, and Bash.

Design, implement, and maintain release automation pipelines to streamline the deployment process.
Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-10-16

📍 Canada

🔍 Software development for small businesses

🏢 Company: Jobber

Demonstrated expertise in providing systems support within a cloud environment, preferably AWS and its various services.
Experience with IaC using Terraform.
Experience optimizing and improving continuous deployment performance.
Ability to juggle multiple projects and incident management.
Relevant experience in programming languages such as Ruby, Python, or Bash.
Deep passion for learning, reliability, automation, orchestration, and continuous improvement.
Strong commitment to problem-solving and keen interest in technology.
Exceptional interpersonal skills for collaboration in high-pressure situations.

Collaborate on the design, implementation, operation, and maintenance of AWS infrastructure.
Leverage Infrastructure-as-Code principles.
Develop and maintain local tooling for development and observability tools, including deployment tools and interfaces into AWS, CircleCI, and other infrastructure components.
Participate in on-call rotation and contribute to enhancing the on-call experience for the team.

AWSPythonBashRubyRuby on RailsReactCollaborationProblem SolvingTerraform

Posted 2024-10-16

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-09-19

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
Enthusiasm for mentoring and sponsoring less-experienced engineers.

Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 2024-09-19

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-09-13

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly

You have 5+ years of experience in an SRE or Infrastructure Engineering role.
5+ years of experience writing software as a SWE or Software heavy SRE role.
You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
You’ve supported services at web or RPC services at a significant scale.
You have experience solving infrastructure problems by writing software.
You have a big-picture perspective on systems and tools.
You can collaborate with other Engineering teams to understand their systems and help to improve them.

Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
Build tools to support our processes.
Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 2024-09-13

Apply

🔥 Senior Site Reliability Engineer, Databases

Posted 2024-08-28

🧭 Full-Time

💸 109000 - 169000 USD per year

🔍 Nonprofit, Technology

Proficient at automation/programming/scripting skills.
Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
Experience with managing remotely both bare-metal servers and virtualized environments.
5+ years experience in an SRE/Operations/DevOps role as part of a team.
Experience with high traffic and highly available website architectures and operations.
Strong English language skills.
Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
B.Sc. or M.Sc. in Computer Science or equivalent work experience.

Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
Improving observability (alerting, metrics, monitoring) of database infrastructure.
Multi-datacenter systems design, capacity and infrastructure planning.
Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28

Apply

Senior Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities