Sr HA-Systems Engineer

Posted 5 days agoViewed

💎 Seniority level: Senior, 8+ years

📍 Location: United States, Canada, Costa Rica

💸 Salary: 103050.0 - 179700.0 USD per year

🔍 Industry: M&A and high-value transactions technology

🏢 Company: datasite

⏳ Experience: 8+ years

🪄 Skills: AWSDockerPythonGCPKubernetesAzureGoGrafanaPrometheusCI/CDLinuxTerraformMicroservicesAnsible

Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
8+ years of experience in systems engineering, infrastructure architecture, or related fields.
Proven track record of designing and implementing highly available, fault-tolerant systems in cloud or on-prem environments.
Experience with distributed systems, microservices architecture, and high availability patterns.
Proficient in cloud platforms (Azure, GCP, AWS) or on-prem data centers and cloud-native technologies.
Deep knowledge and understanding of Linux systems.
Experience using monitoring and observability tools (Prometheus, Grafana, Loki, etc.).
Strong coding/scripting skills in Python, Go, or Shell for automation.
Excellent problem-solving skills with a focus on resilience and scalability.
Strong communication skills with the ability to convey complex technical concepts to diverse stakeholders.
Ability to work independently and take ownership of projects from inception to deployment.

Architect and build highly available, fault-tolerant systems to support mission-critical applications.
Collaborate with cross-functional teams to design scalable, robust, and secure cloud-based solutions.
Develop strategies for disaster recovery, data replication, and failover processes.
Analyze system performance, identify bottlenecks, and implement optimizations to ensure optimal uptime and performance.
Conduct load testing, capacity planning, and performance tuning to meet high availability requirements.
Utilize monitoring tools to proactively detect issues and minimize downtime.
Develop and maintain infrastructure as code (IaC) using tools like Terraform and Ansible.
Implement automation for deployments, scaling, and configuration management to reduce manual intervention and increase system reliability.
Lead incident response and root cause analysis for system outages, ensuring quick resolution and prevention of future incidents.
Build and maintain robust monitoring, alerting, and diagnostic systems for proactive issue identification.
Provide technical leadership, mentorship, and guidance to junior engineers.