Apply

Staff Site Reliability Engineer (SRE)

Posted about 2 months agoViewed

View full description

💎 Seniority level: Staff, 8+ years

💸 Salary: 180000.0 - 240000.0 USD per year

🔍 Industry: IT and Security

⏳ Experience: 8+ years

Requirements:
  • Extensive experience with enterprise scale continuous delivery environments
  • 8+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools like PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team
Responsibilities:
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities
Apply

Related Jobs

Apply

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D over 2 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted about 1 month ago
Apply
Apply

🧭 Full-Time

🔍 Observability and data management

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Proficiency in JavaScript/Node.js/TypeScript development within Linux/Mac environments.
  • Experience with Configuration Management Tools like Terraform, Puppet, Chef, or Ansible.
  • Knowledge of cloud platforms, primarily AWS and Azure, with GCP being a bonus.
  • Extensive experience in designing and implementing observability platforms using OpenSource tools like Grafana and Prometheus.
  • Experience mentoring engineers and serving as a Subject Matter Expert in Monitoring and Observability.
  • Familiarity with native monitoring services in AWS, Azure, and other cloud platforms.
  • Background in Linux Systems Engineering.
  • Experience with incident response tools such as PagerDuty or FireHydrant.
  • Comfortable working autonomously in a distributed team environment.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems focusing on availability, latency, and overall system health.
  • Design observability systems for various applications using Cribl products and OpenSource tools.
  • Identify the causes of errors and instabilities in production cloud services and drive improvements.
  • Work with product and platform teams to enhance systems for better reliability and resilience.
  • Lead the efforts for shift-left monitoring and reduce operational toil through innovation.
  • Participate in on-call responsibilities.
Posted about 1 month ago
Apply
Apply

🧭 Full-Time

💸 152000.0 - 230500.0 USD per year

🔍 IT and Security

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D over 2 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • 5+ years of experience with a DevOps or SRE job title.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Experience with sustainable incident response in a blameless environment.
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering.
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health.
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Help Identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptGrafanaPrometheusDevOps

Posted 6 months ago
Apply

Related Articles

Posted 5 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.