Principal Site Reliability Engineer

Posted 1 day agoViewed

View full description

💎 Seniority level: Principal, 10+ years

📍 Location: United States

💸 Salary: 240000.0 - 400000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D about 3 years agoReal Time Big Data Information Technology Software

🗣️ Languages: English

⏳ Experience: 10+ years

🪄 Skills: AWSDockerNode.jsSQLCloud ComputingGitJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Requirements:

Extensive experience with enterprise scale continuous delivery environments
10+ years of experience in a DevOps or SRE role
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with IaC tools like Terraform (preferred) or similar
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Deep understanding of SRE practices, such as SLOs, Error Budgets, PRRs, Problem Management
Comfortable with a high level of autonomy and working with a distributed team

Responsibilities:

Chart the future of Cribl’s observability and reliability systems and practices
Conceptualize and direct the evolution of our reliability metrics, programs and process based on the state of the art and industry best practices
Engage with Product and Engineering teams to improve service delivery and reliability across the entire software lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Uncover risks and seek out the sources of errors and instability in our production systems.
Advocate engineering-wide improvements in reliability, observability and promote antifragility
Identify and drive down toil with creative innovation and automation
Participate in on-call

Apply

Related Jobs

Apply

🔥 Principal Site Reliability Engineer - Remote

Posted 6 days ago

📍 United States of America

💸 150000.0 - 160000.0 USD per year

🏢 Company: external-northamerica

🔧 Requirements

Bachelor’s degree in Computer Science, Engineering, or related field
A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities
Demonstrated ability to work with cross-functional Development, QE and Operations teams
Strong understanding and experience in automation tools and programming/scripting languages (e.g., PowerShell, Python, Bash) to deliver improvements at a small and large scale.
Strong understanding of Observability tools (e.g., Dynatrace, Datadog, New Relic etc.) and best practices, to implement effective monitoring of SLI/SLO/SLAs.
Strong experience and understanding of software engineering, Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines.
Strong troubleshooting skills coupled with making data-driven decisions during incidents, to improve time to detect and resolve issues.
Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloud-native setups (AKS, serverless, etc.).

💡 Responsibilities

Contribute significantly to the reliability, scalability and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
Implement robust infrastructure, application and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers.
Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the Product, Engineering and SRE teams.
Besides owning Observability tools, create a roadmap to expand and consolidate. This should provide a 360-degree view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure and Enterprise Architecture.
Collaborate with the above cross-functional teams to drive a unified approach to site reliability that optimizes their work and improves time-to-market for all respective objectives.
Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible, Terraform etc.

AWSPythonBashCloud ComputingKubernetesAzureCI/CDDevOpsTerraformAnsibleScriptingSoftware Engineering

Posted 6 days ago

Apply

🔥 Principal Site Reliability Engineer

Posted about 2 months ago

📍 United States

🧭 Full-Time

💸 225900.0 - 319000.0 USD per year

🔍 Software Development

🏢 Company: HashiCorp👥 1001-5000💰 Secondary Market about 4 years ago🫂 Last layoff almost 2 years agoPrivate Cloud DevOps Information Technology Cyber Security Software Cloud Infrastructure

🔧 Requirements

Has worked on a team of SREs or engineers to improve reliability
Loves data, analyzing data, and helping teams draw conclusions from data
Is an effective communicator, collaborator, and influencer with both engineering and product teams
Is able to identify pragmatic and ideal solutions blending customer feedback with technical foresight
Is a technical leader capable of crafting long-term strategy and vision

💡 Responsibilities

Drive an operational excellence mindset throughout our teams
Collaborate with existing SREs across our SaaS products and look for ways to unify reliability and operational behavior and features across SaaS and on-premises products
Work with product teams to help drive reliability and operational excellence requirements to help continuously improve our products
Help develop educational materials and shared tooling to level up reliability and operational thinking across our product teams
Participate in feature partnerships and reviews to ensure compliance with reliability and operational requirements
Participate in crucial decision-making related to various reliability programs

AWSBackend DevelopmentLeadershipCloud ComputingData AnalysisJavaKubernetesZabbixGoGrafanaPrometheusCommunication SkillsAnalytical SkillsCollaborationCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesData visualizationStrategic thinkingData modelingNodeJSScriptingSoftware EngineeringSaaS

Posted about 2 months ago

Apply

🔥 Principal Site Reliability Engineer

Posted 4 months ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Jama Software👥 251-500💰 $200,000,000 Private almost 7 years agoDeveloper Platform Manufacturing Enterprise Software Collaboration Software

🔧 Requirements

5+ years experience in AWS components like EC2, CloudFormation
5+ years experience operating fault-tolerant, scalable applications
4+ years of systems engineering/administration experience with web applications
Scripting and automation skills with languages like Python

💡 Responsibilities

Architect, build, and maintain highly available systems using AWS
Use Terraform for infrastructure as code
Document designs and implementations
Partner with SRE and Engineering teams for reliability best practices

AWSDockerPostgreSQLPythonBashJavaMySQLNginxREST APITomcatTerraform

Posted 4 months ago

Apply