Apply

Principal Engineer, Site Reliability Engineering - Observability

Posted 2024-11-23

View full description

💎 Seniority level: Principal, 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments

📍 Location: United States

💸 Salary: 204000 - 281000 USD per year

🔍 Industry: Cybersecurity

🏢 Company: SentinelOne

⏳ Experience: 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments

🪄 Skills: AWSLeadershipPythonData AnalysisGCPJavaKubernetesMachine LearningAzureData analysisGoCollaborationTerraformMicroservices

Requirements:
  • Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment.
  • 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments.
  • Deep knowledge of incident management, alert correlation, automated triage, self-healing strategies, and SLO frameworks.
  • Strong understanding of observability platforms, including monitoring, logging, and tracing solutions.
  • Proficient in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting for incident management workflows.
  • Experience with machine learning, anomaly detection, or data analytics techniques for real-time alert correlation and triage systems.
  • Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes), with experience in infrastructure-as-code (e.g., Terraform).
  • Ability to make critical architectural decisions with a focus on business impact, reliability, and system performance.
Responsibilities:
  • Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks that meet the needs of a microservices-based SaaS architecture.
  • Ensure solutions align with business priorities and customer impact goals.
  • Define, implement, and monitor SLOs in collaboration with product and engineering teams.
  • Establish reliability standards that meet business and customer expectations, driving accountability and transparency around service performance.
  • Partner with software engineers, SREs, and data scientists to implement and refine monitoring, alerting, alert correlation, auto-remediation, and SLO solutions.
  • Lead initiatives to promote best practices and knowledge sharing across all of SentinelOne engineering.
  • Mentor engineers and contribute to a culture of reliability engineering excellence through thought leadership and guidance on advanced SRE principles and practices.
Apply