Staff Site Reliability Engineer (SRE)

Posted about 2 months agoViewed

💎 Seniority level: Staff, 8+ years

💸 Salary: 180000.0 - 240000.0 USD per year

🔍 Industry: IT and Security

⏳ Experience: 8+ years

Requirements:

Extensive experience with enterprise scale continuous delivery environments

8+ years of experience with a DevOps or SRE job title

Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment

Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible

Experience with sustainable incident response in a blameless environment

Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies

Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.

Background in Linux Systems Engineering

Experience with Incident response related tools like PagerDuty, FireHydrant, Blameless etc.

Comfortable with a high level of autonomy and working with a distributed team

Responsibilities:

Engage with teams and improve service delivery and reliability across their entire lifecycle

Measure and monitor all production systems with an eye towards availability, latency and overall system health

Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence

Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability

Help Identify and drive down toil with creative innovation and automation

On-call responsibilities

Apply

Related Jobs

Apply

🔥 Sr Staff Site Reliability Engineer (SRE), Cloud

Posted about 1 month ago

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D over 2 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise-scale continuous delivery environments.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
Background in Linux Systems Engineering.
Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
Experience with sustainable incident response in a blameless environment.
Comfortable with a high level of autonomy and working with a distributed team.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Lead efforts enabling shift-left monitoring in the organization.
Help identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted about 1 month ago

Apply

🔥 Sr Staff Site Reliability Engineer (SRE), Cloud

Posted about 1 month ago

🧭 Full-Time

🔍 Observability and data management

🔧 Requirements

Extensive experience with enterprise-scale continuous delivery environments.
Proficiency in JavaScript/Node.js/TypeScript development within Linux/Mac environments.
Experience with Configuration Management Tools like Terraform, Puppet, Chef, or Ansible.
Knowledge of cloud platforms, primarily AWS and Azure, with GCP being a bonus.
Extensive experience in designing and implementing observability platforms using OpenSource tools like Grafana and Prometheus.
Experience mentoring engineers and serving as a Subject Matter Expert in Monitoring and Observability.
Familiarity with native monitoring services in AWS, Azure, and other cloud platforms.
Background in Linux Systems Engineering.
Experience with incident response tools such as PagerDuty or FireHydrant.
Comfortable working autonomously in a distributed team environment.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems focusing on availability, latency, and overall system health.
Design observability systems for various applications using Cribl products and OpenSource tools.
Identify the causes of errors and instabilities in production cloud services and drive improvements.
Work with product and platform teams to enhance systems for better reliability and resilience.
Lead the efforts for shift-left monitoring and reduce operational toil through innovation.
Participate in on-call responsibilities.

Posted about 1 month ago

Apply

🔥 Staff Site Reliability Engineer (SRE)

Posted 6 months ago

🧭 Full-Time

💸 152000.0 - 230500.0 USD per year

🔍 IT and Security

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D over 2 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments.
5+ years of experience with a DevOps or SRE job title.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Experience with sustainable incident response in a blameless environment.
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering.
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
Comfortable with a high level of autonomy and working with a distributed team.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency and overall system health.
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Help Identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptGrafanaPrometheusDevOps

Posted 6 months ago

Apply

Remote Job Certifications and Courses to Boost Your Career

Posted 5 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.

Staff Site Reliability Engineer (SRE)

Requirements:

Responsibilities:

Related Jobs

Related Articles

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully

Remote Work Statistics and Insights for 2024