Apply

Principal Site Reliability Engineer

Posted 1 day agoViewed

View full description

💎 Seniority level: Principal, 10+ years

📍 Location: United States

💸 Salary: 240000.0 - 400000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D about 3 years agoReal TimeBig DataInformation TechnologySoftware

🗣️ Languages: English

⏳ Experience: 10+ years

🪄 Skills: AWSDockerNode.jsSQLCloud ComputingGitJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Requirements:
  • Extensive experience with enterprise scale continuous delivery environments
  • 10+ years of experience in a DevOps or SRE role
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with IaC tools like Terraform (preferred) or similar
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Deep understanding of SRE practices, such as SLOs, Error Budgets, PRRs, Problem Management
  • Comfortable with a high level of autonomy and working with a distributed team
Responsibilities:
  • Chart the future of Cribl’s observability and reliability systems and practices
  • Conceptualize and direct the evolution of our reliability metrics, programs and process based on the state of the art and industry best practices
  • Engage with Product and Engineering teams to improve service delivery and reliability across the entire software lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Uncover risks and seek out the sources of errors and instability in our production systems.
  • Advocate engineering-wide improvements in reliability, observability and promote antifragility
  • Identify and drive down toil with creative innovation and automation
  • Participate in on-call
Apply

Related Jobs

Apply

📍 United States of America

💸 150000.0 - 160000.0 USD per year

🏢 Company: external-northamerica

  • Bachelor’s degree in Computer Science, Engineering, or related field
  • A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities
  • Demonstrated ability to work with cross-functional Development, QE and Operations teams
  • Strong understanding and experience in automation tools and programming/scripting languages (e.g., PowerShell, Python, Bash) to deliver improvements at a small and large scale.
  • Strong understanding of Observability tools (e.g., Dynatrace, Datadog, New Relic etc.) and best practices, to implement effective monitoring of SLI/SLO/SLAs.
  • Strong experience and understanding of software engineering, Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines.
  • Strong troubleshooting skills coupled with making data-driven decisions during incidents, to improve time to detect and resolve issues.
  • Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloud-native setups (AKS, serverless, etc.).
  • Contribute significantly to the reliability, scalability and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
  • Implement robust infrastructure, application and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
  • Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers.
  • Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the Product, Engineering and SRE teams.
  • Besides owning Observability tools, create a roadmap to expand and consolidate. This should provide a 360-degree view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure and Enterprise Architecture.
  • Collaborate with the above cross-functional teams to drive a unified approach to site reliability that optimizes their work and improves time-to-market for all respective objectives.
  • Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible, Terraform etc.

AWSPythonBashCloud ComputingKubernetesAzureCI/CDDevOpsTerraformAnsibleScriptingSoftware Engineering

Posted 6 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 225900.0 - 319000.0 USD per year

🔍 Software Development

🏢 Company: HashiCorp👥 1001-5000💰 Secondary Market about 4 years ago🫂 Last layoff almost 2 years agoPrivate CloudDevOpsInformation TechnologyCyber SecuritySoftwareCloud Infrastructure

  • Has worked on a team of SREs or engineers to improve reliability
  • Loves data, analyzing data, and helping teams draw conclusions from data
  • Is an effective communicator, collaborator, and influencer with both engineering and product teams
  • Is able to identify pragmatic and ideal solutions blending customer feedback with technical foresight
  • Is a technical leader capable of crafting long-term strategy and vision
  • Drive an operational excellence mindset throughout our teams
  • Collaborate with existing SREs across our SaaS products and look for ways to unify reliability and operational behavior and features across SaaS and on-premises products
  • Work with product teams to help drive reliability and operational excellence requirements to help continuously improve our products
  • Help develop educational materials and shared tooling to level up reliability and operational thinking across our product teams
  • Participate in feature partnerships and reviews to ensure compliance with reliability and operational requirements
  • Participate in crucial decision-making related to various reliability programs

AWSBackend DevelopmentLeadershipCloud ComputingData AnalysisJavaKubernetesZabbixGoGrafanaPrometheusCommunication SkillsAnalytical SkillsCollaborationCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesData visualizationStrategic thinkingData modelingNodeJSScriptingSoftware EngineeringSaaS

Posted about 2 months ago
Apply
Apply

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Jama Software👥 251-500💰 $200,000,000 Private almost 7 years agoDeveloper PlatformManufacturingEnterprise SoftwareCollaborationSoftware

  • 5+ years experience in AWS components like EC2, CloudFormation
  • 5+ years experience operating fault-tolerant, scalable applications
  • 4+ years of systems engineering/administration experience with web applications
  • Scripting and automation skills with languages like Python
  • Architect, build, and maintain highly available systems using AWS
  • Use Terraform for infrastructure as code
  • Document designs and implementations
  • Partner with SRE and Engineering teams for reliability best practices

AWSDockerPostgreSQLPythonBashJavaMySQLNginxREST APITomcatTerraform

Posted 4 months ago
Apply