Apply

Site Reliability Engineer

Posted 2024-10-21

View full description

πŸ’Ž Seniority level: Senior

πŸ“ Location: Canada, United States

πŸ” Industry: Cyber Security

🏒 Company: BeyondTrust

πŸͺ„ Skills: AWSLeadershipAmazon RDSAWS EKSCloud ComputingElasticSearchAmazon Web ServicesElasticsearchCI/CD

Requirements:
  • Experience in designing and building enterprise-ready cloud-native platforms, with a passion for researching and managing solutions.
  • High standards with continuous improvement towards high-quality products, services, and processes.
  • Ability to simplify complexity and empower development teams.
  • Decision-making based on data with a focus on balancing speed and risk.
  • Understanding of the importance of observability and metric dashboards.
  • Technical familiarity with AWS Cloud Resources (S3, EC2, EKS, RDS, etc.), Service Mesh (Istio), Infrastructure as Code (Terraform, AWS CDK), and Continuous Delivery tools (ArgoCD, GitHub Actions).
Responsibilities:
  • Define a platform for engineering teams to utilize automated, self-service, scalable, efficient, observable, and reliable infrastructure services as a product.
  • Design long-term technical solutions and cross-team mechanisms to achieve reliability goals.
  • Provide expert technical guidance and feedback during engineering design reviews using observability tools.
  • Deliver common, reusable tools, capabilities, and interfaces to the cloud platform solution.
  • Collaborate with SREs and senior engineers on best practices.
  • Align and help drive execution of the Platform Infrastructure team’s strategy.
  • Reduce toil through automation.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ” Legal technology

🏒 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 102600 - 120323 USD per year

πŸ” Recycling technology

🏒 Company: AMP Sortation

  • Strong technical communication skills for ticket escalations.
  • Strong interpersonal skills for communicating with individuals impacted by downtime.
  • Experience troubleshooting Linux systems.
  • Demonstrated coding experience in C++ or Rust.
  • Desire to learn professional software engineering practices.
  • Proficiency in managing tasks under sprint or kanban methodology.
  • Passion for green technology and emissions reduction.

  • Triage and respond to tickets during core working hours.
  • Troubleshoot operating system, hardware, networking, and application issues.
  • Maintain documentation for engineering support.
  • Define improvements to the Jira ticketing system.
  • Develop and support AMP's observability stack.

C++JiraGrafanaPrometheusRustCommunication SkillsLinuxDocumentation

Posted 2024-11-21
Apply
Apply

πŸ“ USA

πŸ’Έ 170000 - 190000 USD per year

πŸ” Email Security

🏒 Company: Valimail

  • 5+ years experience building and maintaining highly available relational databases.
  • Work collaboratively with cross functional teams
  • Value team success over individual success
  • Put industry and engineering best practices into practice and promotes them to others
  • Passion for reliable, scalable, and performant datastores with strong sense of ownership
  • Experience building and supporting highly performant and highly reliable datastores
  • Deep experience working with Postgres
  • Expert in database fundamentals, SQL, PL/pgSQL, (or other)
  • Experience with NOSQL datastores and caching solutions
  • Working knowledge of AWS or Azure cloud providers
  • Experience with Infrastructure-as-Code tools, such as Terraform

  • Evangelizing standard methodologies for building and operating highly reliable data storage systems
  • Serving as the subject matter expert in datastore design and performance
  • Building and supporting Valimail’s mission-critical datastores
  • Conducting timely post mortems of production datastore incidents
  • Collaboratively designing systems with other engineers to meet reliability, scalability, and performance requirements
  • Providing assistance to teams working with datastores
  • Automating routine database tasks
  • Participating in on-call rotation and incident response.
  • Upgrade data storage systems as necessary

AWSSQLAzurePostgresNosqlTerraform

Posted 2024-11-20
Apply
Apply

πŸ“ Canada

πŸ” Software Supply Chain Management

🏒 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

πŸ“ U.S.

🧭 Full-Time

πŸ’Έ 140000 - 160000 USD per year

πŸ” Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” InsurTech

  • Extensive experience in infrastructure security, monitoring, release engineering, and developer tooling.
  • Ability to coach and mentor less experienced professionals.
  • Demonstrated leadership skills in guiding teams and improving capabilities.

  • Work with the Engineering Department to develop and provide infrastructure security, monitoring, release engineering, and developer tooling based on group-level and department-level requirements.
  • Provide guidance and leadership to DevOps chapter representatives from teams across the Engineering Department.
  • Suggest, plan, guide, and assist with the development and implementation of infrastructure to support goals.
  • Coach and mentor lower-level professionals.
  • Assist the Engineering Leadership Team in continuously improving craft capabilities.

LeadershipMentoringDevOpsCoaching

Posted 2024-11-15
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 204000 - 281000 USD per year

πŸ” Cybersecurity

🏒 Company: SentinelOne

  • Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment.
  • 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments.
  • Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, and SLO frameworks.
  • Proficiency in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting.
  • Experience with machine learning and data analytics for real-time alert systems.
  • Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes).
  • Ability to make critical architectural decisions focused on business impact and system performance.

  • Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks for a microservices SaaS architecture.
  • Ensure solutions align with business priorities and customer impact goals.
  • Define, implement, and monitor SLOs in collaboration with product and engineering teams.
  • Establish reliability standards to drive accountability around service performance.
  • Partner with software engineers, SREs, and data scientists to implement monitoring, alerting, and SLO solutions.
  • Lead initiatives promoting best practices across SentinelOne engineering.
  • Mentor engineers and contribute to a culture of reliability engineering excellence.

AWSLeadershipPythonData AnalysisGCPJavaKubernetesMachine LearningAzureData analysisGoCollaborationTerraformMicroservices

Posted 2024-11-15
Apply
Apply

πŸ“ United States

πŸ’Έ 192000 - 288000 USD per year

πŸ” Frontend Cloud and web services

🏒 Company: Vercel

  • At least 3 years of experience in an SRE role, or at least 5 years of experience in an adjacent role (e.g., platform engineering), operating in a scaled environment.
  • Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.
  • Strong sense of accountability and commitment to problem-solving, backed by curiosity to dig deep and identify root causes.
  • Willingness to proactively engage with development teams to influence the course of software design and operational practices.
  • Capability to manage risk, make decisions, and exhibit sound judgment.
  • Demonstrated ability to plan and deliver long-term projects.
  • Familiarity with networking protocols and application serving.
  • Experience deploying and operating systems on AWS infrastructure at scale.
  • Bonus: Experience working with Terraform, Kubernetes, Golang, and/or Lua.

  • Ensure that our products are built for reliability and scale by engaging in the end-to-end design, development, and deployment of new software.
  • Drive continuous risk mitigation and reduction through direct involvement in incident management, blameless postmortems, and follow-ups.
  • Drive measurable improvements to the reliability, performance, and efficiency of our production systems through instrumentation, analysis, and implementation of engineering improvements.
  • Devise repeatable, low-toil operational practices through the development of automated systems for software delivery, system failover, and capacity management.

AWSProblem Solving

Posted 2024-11-13
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 147100 - 207600 USD per year

πŸ” Cloud Infrastructure and Software Engineering

🏒 Company: HashiCorp

  • Professional experience designing or operating disaster recovery processes in a distributed cloud environment.
  • Professional experience with incident management in cloud environments.
  • Enjoy working on various scopes spanning software engineering, cloud infrastructure, and SRE.
  • Experience contributing to efficiency improvements of software at scale.
  • Experience collaborating cross-functionally to deliver engineering culture change.
  • Worked on infrastructure teams in customer-centric and agile organizations with empathy and compassion.
  • Worked with SaaS or other managed software offerings.
  • Experience in one or more of the major public clouds.

  • Utilize software engineering experience to solve problems and build automation for incident lifecycle management.
  • Coordinate disaster recovery processes and identify strategic process improvements.
  • Drive incident management capabilities and culture.
  • Participate in incident command on-call rotation.
  • Support incident management tooling.
  • Build technical skills and relationships within a team of engineers and SREs.
  • Learn, teach, and collaborate cross-functionally.

AgileProduct DevelopmentStrategyCommunication SkillsCollaboration

Posted 2024-11-12
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Security and fraud detection

🏒 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply