Apply

Senior Site Reliability Engineer

Posted 2024-10-26

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ’Έ Salary: 120000 - 140000 USD per year

πŸ” Industry: Healthcare

🏒 Company: Hone Health

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

πŸͺ„ Skills: Microservices

Requirements:
  • Proven SRE experience in a highly complex, mission-critical environment (5+ years).
  • Excellent problem-solving skills, with the ability to troubleshoot complex technical issues.
  • Extensive experience with Azure cloud technologies, including designing and optimizing complex architecture.
  • Deep understanding of cloud architecture, microservices, and containerization.
  • Strong understanding of cloud security principles, including security audits and compliance.
  • Proficiency in infrastructure as code and CI/CD pipelines.
  • Experience with disaster recovery planning and business continuity.
  • Strong interpersonal skills for effective collaboration across teams.
  • Azure certifications are a plus.
  • Background in managing EHR/EMR systems in healthcare is a strong plus.
Responsibilities:
  • Oversee the management, optimization, and scaling of EHR/EMR cloud infrastructure.
  • Serve as the primary custodian of the EHR/EMR platform ensuring high availability and data integrity.
  • Manage IT infrastructure including servers, networking, and endpoints.
  • Implement security practices and compliance measures to protect healthcare data.
  • Develop disaster recovery plans and maintain business continuity.
  • Collaborate with cross-functional teams to align technology with company goals.
  • Maintain documentation of technology operations processes and system architecture.
  • Implement monitoring tools to proactively detect issues and optimize performance.
Apply

Related Jobs

Apply

πŸ“ Czech Republic

πŸ” Software Infrastructure

NOT STATED

  • Solve complex system problems using automation.
  • Address storage issues with innovative software-defined solutions.
  • Manage network challenges effectively.
Posted 2024-11-22
Apply
Apply

πŸ” Commission management

NOT STATED

  • Operate across the engineering organization to support development teams with the needed tools and processes.
  • Ensure great service quality for paying customers and keep the business informed when issues arise.
  • Provide infrastructure, platform, reliability, and observability support to internal customers.
  • Invest in iterative efforts to refine work and deliver real-world results while improving processes.
Posted 2024-11-22
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ” Legal technology

🏒 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

πŸ“ Poland

πŸ” Software

Posted 2024-11-21
Apply
Apply

🧭 Full-Time

πŸ” Software / SaaS

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Experience in a cloud-based SaaS environment.
  • Strong expertise in designing and implementing disaster recovery solutions using industry-leading technologies and methodologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
  • Excellent communication skills with the ability to effectively collaborate with cross-functional teams and communicate technical concepts to non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environments.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to test and validate the effectiveness of these plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate potential risks to system availability and data integrity.
  • Monitor system performance and health metrics; proactively identify areas for improvement.
  • Implement preventive measures to enhance system reliability and resilience.
  • Participate in incident response and post-incident reviews; analyze root causes of failures.
  • Implement corrective actions to prevent recurrence.
Posted 2024-11-21
Apply
Apply

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Strong expertise in designing and implementing disaster recovery solutions using leading technologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Excellent communication skills for collaboration with cross-functional teams and non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for a cloud-based SaaS environment.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to validate the effectiveness of disaster recovery plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate risks.
  • Proactively monitor system performance and health metrics, implement preventive measures.
  • Participate in incident response and post-incident reviews to analyze root causes and implement corrective actions.
Posted 2024-11-20
Apply
Apply

πŸ“ Canada

πŸ” Software Supply Chain Management

🏒 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

πŸ“ U.S.

🧭 Full-Time

πŸ’Έ 140000 - 160000 USD per year

πŸ” Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

🧭 Contract

  • Minimum of 5-7 years experience in Site Reliability Engineering or related fields.
  • Proven experience designing and implementing fault-tolerant, scalable systems.
  • Deep understanding of reliability methodologies like DFR, FMEA, and MTBF.
  • Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage.
  • Strong coding skills in one or more programming languages relevant to SRE.
  • Exceptional analytical skills for complex issue investigation.
  • Willingness to learn new products and tools.
  • Excellent communication skills for a distributed team environment.

  • Identify and resolve complex bugs within the codebase.
  • Enhance system reliability, scalability, and performance through code maintenance.
  • Restart services and implement necessary code changes.
  • Investigate complex system issues and develop resolutions.
  • Design and build fault-tolerant, scalable systems for high availability.
  • Apply methodologies like DFR, FMEA, and MTBF.
  • Develop and maintain reliability standards and documentation.
Posted 2024-11-12
Apply
Apply

πŸ“ LATAM

πŸ” AI development tools

  • Leverage skills, knowledge, and adaptability to address complex infrastructure needs.
  • Provide high-quality solutions tailored to each enterprise customer's unique requirements.

  • Report to the Enterprise Engineering Manager.
  • Set up and maintain infrastructure standards.
  • Play a pivotal role in external and internal tool development.
  • Facilitate software deployment to enterprise customers.
  • Establish partnerships with enterprise customers to improve satisfaction.
  • Manage variances in infrastructure types and implement suitable solutions.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply