Apply

Senior Site Reliability Engineer

Posted 2024-09-13

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: North America

🔍 Industry: Incident Management Platform

🏢 Company: Rootly

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Requirements:
  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.
Responsibilities:
  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).
Apply

Related Jobs

Apply

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2024-11-24
Apply
Apply

📍 United States

🧭 Full-Time

🔍 Legal technology

🏢 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

📍 Canada

🔍 Software Supply Chain Management

🏢 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

📍 U.S.

🧭 Full-Time

💸 140000 - 160000 USD per year

🔍 Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

📍 United States

🧭 Full-Time

💸 150000 - 230000 USD per year

🔍 Public safety technology

🏢 Company: Axon

  • This position involves handling of classified federal data; under federal regulations, it is open to US Citizens only.
  • 10+ years of applicable experience.
  • Experience managing cloud platforms such as Azure, AWS, or similar.
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar.
  • Experience using managed languages such as Python, Go, C#, Java, or similar.
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases.
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues.
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems.
  • Familiarity with building flexible and testable Infrastructure as Code modules.
  • Empathy to support the needs of software engineers.

  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely.
  • Exemplify cloud-native site reliability best practices.
  • Write code that is performant, maintainable, clear, and concise.
  • Employ strong problem-solving skills to debug problems in cloud-native distributed systems.
  • Influence and educate the engineering organization to adopt new and improved architectural patterns.
  • Provide robust documentation for use by engineers to promote self-service.
  • Take calculated risks, champion new ideas, and cultivate your craft.

AWSPythonJavaKubernetesC#AzureGoCI/CD

Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply
Apply

📍 CA, CO, CT, FL, GA, IL, IN, KY, MA, MI, MN, NC, NJ, NY, OH, OR, PA, SC, TN, TX, UT, VA, WA, WI

💸 145000 - 175000 USD per year

🔍 Benefits and employee experience

🏢 Company: Jellyvision

  • Demonstrated experience with cloud computing platforms, particularly AWS.
  • Proficient in programming languages including Ruby, Python, and JavaScript.
  • Experienced with configuration management tools such as Ansible, Packer, CloudFormation, and a strong emphasis on Terraform.
  • Skilled in container technologies and orchestration tools like Docker, ECS, and Kubernetes.
  • Experience with continuous integration tools such as GitLab, GitHub, and Jenkins.
  • Knowledge of best practices for monitoring and alerting to ensure system reliability.
  • Exceptional communication skills with various stakeholders.
  • Strong data-driven decision-making capabilities.

  • Design applications by advising development teams on best practices and architecting solutions for optimal performance.
  • Optimize CI/CD pipelines through strategic guidance, minimizing manual tasks, and enhancing operational efficiency.
  • Monitor systems by efficiently resolving alerts, participating in on-call rotations, and supporting application management.
  • Mentor team members by providing guidance, seeking continuous learning opportunities, and giving constructive feedback.

AWSDockerPythonCloud ComputingJavascriptJenkinsKubernetesRubyJavaScriptCommunication SkillsCollaborationCI/CD

Posted 2024-10-21
Apply
Apply

📍 USA

💸 160000 - 195000 USD per year

🔍 Healthcare

🏢 Company: Garner Health

  • 5+ years of experience delivering software solutions.
  • 4+ years of hands-on production work with cloud infrastructure, containers, monitoring, and alerting.
  • 3+ years working in a security-conscious environment.
  • Expertise and experience leading cloud-first/only projects, preferably on AWS.
  • Expertise with Terraform.
  • Experience with Kubernetes.
  • Experience with Go and Python, particularly utilizing Kubernetes APIs.

  • Architect, operate, improve, and secure the platform the Garner Health app runs on.
  • Boost developer productivity.
  • Build systems to a high engineering standard and ensure others adhere to these standards.
  • Research and advocate for improved techniques, processes, and designs.
  • Collaborate with teammates on strategic platform initiatives.
  • Support the Garner platform in production.
  • Ensure security in production according to regulatory requirements.
  • Partner with stakeholders to maintain product availability and performance.

AWSPythonKubernetesGoTerraform

Posted 2024-10-21
Apply
Apply

📍 Canada

🔍 Software development for small businesses

🏢 Company: Jobber

  • Demonstrated expertise in providing systems support within a cloud environment, preferably AWS and its various services.
  • Experience with IaC using Terraform.
  • Experience optimizing and improving continuous deployment performance.
  • Ability to juggle multiple projects and incident management.
  • Relevant experience in programming languages such as Ruby, Python, or Bash.
  • Deep passion for learning, reliability, automation, orchestration, and continuous improvement.
  • Strong commitment to problem-solving and keen interest in technology.
  • Exceptional interpersonal skills for collaboration in high-pressure situations.

  • Collaborate on the design, implementation, operation, and maintenance of AWS infrastructure.
  • Leverage Infrastructure-as-Code principles.
  • Develop and maintain local tooling for development and observability tools, including deployment tools and interfaces into AWS, CircleCI, and other infrastructure components.
  • Participate in on-call rotation and contribute to enhancing the on-call experience for the team.

AWSPythonBashRubyRuby on RailsReactCollaborationProblem SolvingTerraform

Posted 2024-10-16
Apply