Site Reliability Engineer

Posted 2024-10-21

View full description

💎 Seniority level: Senior

📍 Location: Canada, United States

🔍 Industry: Cyber Security

🏢 Company: BeyondTrust

🪄 Skills: AWSLeadershipAmazon RDSAWS EKSCloud ComputingElasticSearchAmazon Web ServicesElasticsearchCI/CD

Requirements:

Experience in designing and building enterprise-ready cloud-native platforms, with a passion for researching and managing solutions.
High standards with continuous improvement towards high-quality products, services, and processes.
Ability to simplify complexity and empower development teams.
Decision-making based on data with a focus on balancing speed and risk.
Understanding of the importance of observability and metric dashboards.
Technical familiarity with AWS Cloud Resources (S3, EC2, EKS, RDS, etc.), Service Mesh (Istio), Infrastructure as Code (Terraform, AWS CDK), and Continuous Delivery tools (ArgoCD, GitHub Actions).

Responsibilities:

Define a platform for engineering teams to utilize automated, self-service, scalable, efficient, observable, and reliable infrastructure services as a product.
Design long-term technical solutions and cross-team mechanisms to achieve reliability goals.
Provide expert technical guidance and feedback during engineering design reviews using observability tools.
Deliver common, reusable tools, capabilities, and interfaces to the cloud platform solution.
Collaborate with SREs and senior engineers on best practices.
Align and help drive execution of the Platform Infrastructure team’s strategy.
Reduce toil through automation.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-21

📍 United States

🧭 Full-Time

🔍 Legal technology

🏢 Company: Ramp Talent

Curiosity, willingness to learn, and passion for continuous improvement.
Proficiency in all skills expected of SRE II's.
Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
A minimum of 8 years of experience in hands-on technical roles.
A minimum of 2 years of Site Reliability Engineering experience.
Experience building autonomous systems that manage software operational details without human intervention.

Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
Being the voice of Reliability on your team throughout the SDLC.
Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
Improving the CI/CD pipeline.
Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
Identifying and fixing gaps in the availability of systems.
Improving and defending the security of software and systems.
Documenting and diagramming processes, procedures, and best practices.
Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
Mentoring, training, and reviewing more junior engineers.
Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21

Apply

🔥 Site Reliability Engineer - Embedded

Posted 2024-11-21

📍 United States

🧭 Full-Time

💸 102600 - 120323 USD per year

🔍 Recycling technology

🏢 Company: AMP Sortation

Strong technical communication skills for ticket escalations.
Strong interpersonal skills for communicating with individuals impacted by downtime.
Experience troubleshooting Linux systems.
Demonstrated coding experience in C++ or Rust.
Desire to learn professional software engineering practices.
Proficiency in managing tasks under sprint or kanban methodology.
Passion for green technology and emissions reduction.

Triage and respond to tickets during core working hours.
Troubleshoot operating system, hardware, networking, and application issues.
Maintain documentation for engineering support.
Define improvements to the Jira ticketing system.
Develop and support AMP's observability stack.

C++JiraGrafanaPrometheusRustCommunication SkillsLinuxDocumentation

Posted 2024-11-21

Apply

🔥 Senior Site Reliability Engineer, Databases

Posted 2024-11-20

📍 USA

💸 170000 - 190000 USD per year

🔍 Email Security

🏢 Company: Valimail

5+ years experience building and maintaining highly available relational databases.
Work collaboratively with cross functional teams
Value team success over individual success
Put industry and engineering best practices into practice and promotes them to others
Passion for reliable, scalable, and performant datastores with strong sense of ownership
Experience building and supporting highly performant and highly reliable datastores
Deep experience working with Postgres
Expert in database fundamentals, SQL, PL/pgSQL, (or other)
Experience with NOSQL datastores and caching solutions
Working knowledge of AWS or Azure cloud providers
Experience with Infrastructure-as-Code tools, such as Terraform

Evangelizing standard methodologies for building and operating highly reliable data storage systems
Serving as the subject matter expert in datastore design and performance
Building and supporting Valimail’s mission-critical datastores
Conducting timely post mortems of production datastore incidents
Collaboratively designing systems with other engineers to meet reliability, scalability, and performance requirements
Providing assistance to teams working with datastores
Automating routine database tasks
Participating in on-call rotation and incident response.
Upgrade data storage systems as necessary

AWSSQLAzurePostgresNosqlTerraform

Posted 2024-11-20

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-17

📍 Canada

🔍 Software Supply Chain Management

🏢 Company: FOSSA

Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
Experience with packaging and deploying services using Docker on Linux.
Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
Willingness to accept feedback and incorporate it into work.
Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
Willingness to take part in an on-call rotation.

Scale cloud infrastructure to meet increasing demand.
Assist development teams in deploying new services.
Ensure platform security and adherence to best practices.
Improve development tools, CI/CD pipelines, monitoring, and release processes.
Help teams use Helm and Kubernetes, and shape best practices.
Build access control and secret management solutions.
Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-16

📍 U.S.

🧭 Full-Time

💸 140000 - 160000 USD per year

🔍 Cybersecurity / Open source software

Sense of curiosity, resourcefulness, and pragmatism.
Expertise with multi-region deployments in public cloud environments.
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
Strong background in Reliability Engineering, DevOps, Software Engineering.
Fluency with at least one programming language, such as C#, Python, or Go.
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
Proficiency using source control such as Git.
Ability to maintain discretion and handle sensitive information.
Staying current with trends and new technologies.
Collaborative and adaptable mindset.
Excellent communication skills.
Strong problem-solving skills.

Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
Implement site reliability tools and observability systems.
Respond to outages and participate in a 24x7 support strategy.
Contribute to architectural designs and engineering operations at scale.
Engage in code reviews and spread technical knowledge.
Contribute to incident management processes.
Collaborate with teams to refine priorities and deliverables.
Align SLIs, SLOs, and SLAs with product owners.
Identify opportunities for new initiatives.
Influence the SDLC as Bitwarden scales.
Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16

Apply

🔥 Senior Site Reliability Engineer (Remote First)

Posted 2024-11-15

📍 Canada

🧭 Full-Time

🔍 InsurTech

Extensive experience in infrastructure security, monitoring, release engineering, and developer tooling.
Ability to coach and mentor less experienced professionals.
Demonstrated leadership skills in guiding teams and improving capabilities.

Work with the Engineering Department to develop and provide infrastructure security, monitoring, release engineering, and developer tooling based on group-level and department-level requirements.
Provide guidance and leadership to DevOps chapter representatives from teams across the Engineering Department.
Suggest, plan, guide, and assist with the development and implementation of infrastructure to support goals.
Coach and mentor lower-level professionals.
Assist the Engineering Leadership Team in continuously improving craft capabilities.

LeadershipMentoringDevOpsCoaching

Posted 2024-11-15

Apply

🔥 Principal Site Reliability Engineer

Posted 2024-11-15

📍 United States

🧭 Full-Time

💸 204000 - 281000 USD per year

🔍 Cybersecurity

🏢 Company: SentinelOne

Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment.
15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments.
Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, and SLO frameworks.
Proficiency in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting.
Experience with machine learning and data analytics for real-time alert systems.
Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes).
Ability to make critical architectural decisions focused on business impact and system performance.

Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks for a microservices SaaS architecture.
Ensure solutions align with business priorities and customer impact goals.
Define, implement, and monitor SLOs in collaboration with product and engineering teams.
Establish reliability standards to drive accountability around service performance.
Partner with software engineers, SREs, and data scientists to implement monitoring, alerting, and SLO solutions.
Lead initiatives promoting best practices across SentinelOne engineering.
Mentor engineers and contribute to a culture of reliability engineering excellence.

AWSLeadershipPythonData AnalysisGCPJavaKubernetesMachine LearningAzureData analysisGoCollaborationTerraformMicroservices

Posted 2024-11-15

Apply

🔥 Site Reliability Engineer, Edge

Posted 2024-11-13

📍 United States

💸 192000 - 288000 USD per year

🔍 Frontend Cloud and web services

🏢 Company: Vercel

At least 3 years of experience in an SRE role, or at least 5 years of experience in an adjacent role (e.g., platform engineering), operating in a scaled environment.
Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.
Strong sense of accountability and commitment to problem-solving, backed by curiosity to dig deep and identify root causes.
Willingness to proactively engage with development teams to influence the course of software design and operational practices.
Capability to manage risk, make decisions, and exhibit sound judgment.
Demonstrated ability to plan and deliver long-term projects.
Familiarity with networking protocols and application serving.
Experience deploying and operating systems on AWS infrastructure at scale.
Bonus: Experience working with Terraform, Kubernetes, Golang, and/or Lua.

Ensure that our products are built for reliability and scale by engaging in the end-to-end design, development, and deployment of new software.
Drive continuous risk mitigation and reduction through direct involvement in incident management, blameless postmortems, and follow-ups.
Drive measurable improvements to the reliability, performance, and efficiency of our production systems through instrumentation, analysis, and implementation of engineering improvements.
Devise repeatable, low-toil operational practices through the development of automated systems for software delivery, system failover, and capacity management.

AWSProblem Solving

Posted 2024-11-13

Apply

🔥 Sr. Site Reliability Engineer, Incident Excellence

Posted 2024-11-12

📍 United States

🧭 Full-Time

💸 147100 - 207600 USD per year

🔍 Cloud Infrastructure and Software Engineering

🏢 Company: HashiCorp

Professional experience designing or operating disaster recovery processes in a distributed cloud environment.
Professional experience with incident management in cloud environments.
Enjoy working on various scopes spanning software engineering, cloud infrastructure, and SRE.
Experience contributing to efficiency improvements of software at scale.
Experience collaborating cross-functionally to deliver engineering culture change.
Worked on infrastructure teams in customer-centric and agile organizations with empathy and compassion.
Worked with SaaS or other managed software offerings.
Experience in one or more of the major public clouds.

Utilize software engineering experience to solve problems and build automation for incident lifecycle management.
Coordinate disaster recovery processes and identify strategic process improvements.
Drive incident management capabilities and culture.
Participate in incident command on-call rotation.
Support incident management tooling.
Build technical skills and relationships within a team of engineers and SREs.
Learn, teach, and collaborate cross-functionally.

AgileProduct DevelopmentStrategyCommunication SkillsCollaboration

Posted 2024-11-12

Apply

🔥 Senior Site Reliability Engineer - US/Canada

Posted 2024-11-09

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

5+ years of experience with production environment running Linux.
3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
Familiarity with big data technologies such as Spark and/or Flink.
Passion for automating tasks through coding and scripting.
Experience with algorithms, data structures, complexity analysis, and software design.
Proficient coding skills in Python, Java, and Bash.

Design, implement, and maintain release automation pipelines to streamline the deployment process.
Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09

Apply

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities