Apply

Sr HA-Systems Engineer

Posted 5 days agoViewed

View full description

💎 Seniority level: Senior, 8+ years

📍 Location: United States, Canada, Costa Rica

💸 Salary: 103050.0 - 179700.0 USD per year

🔍 Industry: M&A and high-value transactions technology

🏢 Company: datasite

⏳ Experience: 8+ years

🪄 Skills: AWSDockerPythonGCPKubernetesAzureGoGrafanaPrometheusCI/CDLinuxTerraformMicroservicesAnsible

Requirements:
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 8+ years of experience in systems engineering, infrastructure architecture, or related fields.
  • Proven track record of designing and implementing highly available, fault-tolerant systems in cloud or on-prem environments.
  • Experience with distributed systems, microservices architecture, and high availability patterns.
  • Proficient in cloud platforms (Azure, GCP, AWS) or on-prem data centers and cloud-native technologies.
  • Deep knowledge and understanding of Linux systems.
  • Experience using monitoring and observability tools (Prometheus, Grafana, Loki, etc.).
  • Strong coding/scripting skills in Python, Go, or Shell for automation.
  • Excellent problem-solving skills with a focus on resilience and scalability.
  • Strong communication skills with the ability to convey complex technical concepts to diverse stakeholders.
  • Ability to work independently and take ownership of projects from inception to deployment.
Responsibilities:
  • Architect and build highly available, fault-tolerant systems to support mission-critical applications.
  • Collaborate with cross-functional teams to design scalable, robust, and secure cloud-based solutions.
  • Develop strategies for disaster recovery, data replication, and failover processes.
  • Analyze system performance, identify bottlenecks, and implement optimizations to ensure optimal uptime and performance.
  • Conduct load testing, capacity planning, and performance tuning to meet high availability requirements.
  • Utilize monitoring tools to proactively detect issues and minimize downtime.
  • Develop and maintain infrastructure as code (IaC) using tools like Terraform and Ansible.
  • Implement automation for deployments, scaling, and configuration management to reduce manual intervention and increase system reliability.
  • Lead incident response and root cause analysis for system outages, ensuring quick resolution and prevention of future incidents.
  • Build and maintain robust monitoring, alerting, and diagnostic systems for proactive issue identification.
  • Provide technical leadership, mentorship, and guidance to junior engineers.
Apply