Apply

Staff Site Reliability Engineer

Posted 2 days agoViewed

View full description

๐Ÿ’Ž Seniority level: Staff, 8+ years

๐Ÿ’ธ Salary: 145000.0 - 175000.0 USD per year

๐Ÿ” Industry: Entertainment

๐Ÿ—ฃ๏ธ Languages: English

โณ Experience: 8+ years

Requirements:
  • Bachelorโ€™s degree in computer science, Information Technology, or a related field (or equivalent experience).
  • 8+ yearsโ€™ experience in Information Technology, with 5+ years in desktop/end user systems engineering and administration.
  • Comfortable with IT security and compliance best practices.
  • Ability to build effective cross-functional relationships.
  • Experience with automated workstation build methodologies, software packaging, and deployment systems.
  • System administration experience with Windows, Linux, and macOS.
  • Familiarity with automation and scripting languages such as Ansible, Bash, Perl, Python, and PowerShell.
  • Experience with Intune, AutoPilot, SCCM, Nexthink, Active Directory, Jamf, and M365.
  • Knowledge in AWS Workspace, VMWare Horizon Cloud, Citrix Workspace, Microsoft W365, and/or Microsoft AzureAD.
  • Proficient in implementing and supporting MDM tools.
Responsibilities:
  • Design and operate global workplace solutions used by NBCUniversal employees and partners.
  • Manage device lifecycle and health analytics for corporate and personal devices.
  • Vet vendor solutions and execute initiatives as a technology expert.
  • Research new technologies and beta test products.
  • Document and train engineers and operations technicians on new processes.
  • Collaborate across multiple teams including Engineering, Operations, Network, and Security.
Apply

Related Jobs

Apply

๐Ÿ’ธ 159408.0 - 236160.0 USD per year

๐Ÿ” Financial technology

๐Ÿข Company: Stash๐Ÿ‘ฅ 1-10๐Ÿ’ฐ Seed over 9 years agoMedicalInformation TechnologyHealth Care

  • 8+ years of experience in site reliability engineering or a similar role.
  • Strong expertise in Kubernetes (K8s) and Amazon EKS.
  • Advanced skills in AWS, including setup, management, and optimization.
  • Proficiency in infrastructure as code, particularly Terraform and Terraform Cloud.
  • Solid programming skills in Python and/or Go.
  • Experience with system monitoring tools like Datadog and familiarity with logging and archiving practices.
  • Extensive experience with GitHub Actions for CI/CD pipelines.
  • Proven track record in designing and managing microservice architectures using Docker and containers.
  • Practical experience with Kafka.
  • Deep understanding of SLOs, SLIs, and SLAs, and their application in maintaining system reliability.
  • Experience working in PCI and other regulated environments.
  • Design, develop, and maintain scalable and resilient cloud infrastructure using AWS.
  • Implement and oversee monitoring systems to ensure optimal performance and rapid response to issues.
  • Automate deployment pipelines and manage CI/CD processes using tools like GitHub Actions.
  • Make high-impact architectural decisions to improve system efficiency and reduce downtime.
  • Collaborate with engineering teams to innovate and enhance deployment and operational capabilities.
  • Develop and manage microservices architectures using Docker and containerization technologies.
Posted 3 days ago
Apply
Apply

๐Ÿ“ Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

๐Ÿ” Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

๐Ÿ“ Brazil

๐Ÿ” Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKafkaKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

๐Ÿ“ Brazil

๐Ÿ” Corporate wellness

๐Ÿข Company: Wellhub

  • Proven technical experience with AWS cloud services and Kubernetes.
  • Deep knowledge of Kubernetes and related ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and cloud-native automation.
  • Build tools for engineering teams to manage their cloud resources autonomously.
  • Ensure security and compliance by delivering secure products and implementing DevSecOps.
  • Improve observability, reliability, and cost awareness.
  • Support other engineering teams in product and tools usage.
  • Build and maintain CI/CD tools and services.
  • Maintain highly available and reliable Kubernetes clusters.
  • Contribute to product documentation.
  • Participate in defining standards, guidelines and best practices.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 23 days ago
Apply
Apply

๐Ÿ“ Canada

๐Ÿงญ Full-Time

๐Ÿ” Observability and data management

๐Ÿข Company: Cribl๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $150,000,000 Series D over 2 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted about 1 month ago
Apply
Apply

๐Ÿงญ Full-Time

๐Ÿ” Observability and data management

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Proficiency in JavaScript/Node.js/TypeScript development within Linux/Mac environments.
  • Experience with Configuration Management Tools like Terraform, Puppet, Chef, or Ansible.
  • Knowledge of cloud platforms, primarily AWS and Azure, with GCP being a bonus.
  • Extensive experience in designing and implementing observability platforms using OpenSource tools like Grafana and Prometheus.
  • Experience mentoring engineers and serving as a Subject Matter Expert in Monitoring and Observability.
  • Familiarity with native monitoring services in AWS, Azure, and other cloud platforms.
  • Background in Linux Systems Engineering.
  • Experience with incident response tools such as PagerDuty or FireHydrant.
  • Comfortable working autonomously in a distributed team environment.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems focusing on availability, latency, and overall system health.
  • Design observability systems for various applications using Cribl products and OpenSource tools.
  • Identify the causes of errors and instabilities in production cloud services and drive improvements.
  • Work with product and platform teams to enhance systems for better reliability and resilience.
  • Lead the efforts for shift-left monitoring and reduce operational toil through innovation.
  • Participate in on-call responsibilities.
Posted about 1 month ago
Apply
Apply

๐Ÿงญ Full-Time

๐Ÿ’ธ 180000.0 - 240000.0 USD per year

๐Ÿ” IT and Security

  • Extensive experience with enterprise scale continuous delivery environments
  • 8+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools like PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities
Posted about 2 months ago
Apply
Apply

๐Ÿ“ Virginia, USA

๐Ÿงญ Full-Time

๐Ÿ’ธ 136500.0 - 195000.0 USD per year

๐Ÿ” Cybersecurity, Cloud Security

๐Ÿข Company: Zscaler

  • Over 5 years of Site Reliability Engineering experience in both Operations and Engineering environments.
  • Extensive experience with High/Moderate FedRAMP authorization levels and monthly monitoring, including vulnerability scanning, evaluation, patching, and reporting.
  • Proficiency in Linux administration, network troubleshooting, and automation tools like Ansible and Terraform for infrastructure as code.
  • Skilled in Python coding, with knowledge of container-based architectures (AWS ECS, Kubernetes), virtualization, cloud services, web security, and networking protocols (HTTP, SSL/TLS, DNS, SQL).
  • Oversee operational tasks for FedRAMP cloud products, including deployments, on-call duties, and incident management.
  • Participate in regular deployment sync meetings and operational hand-offs.
  • Manage all cloud infrastructure components such as AWS GovCloud, private cloud environments, containers, and VMs.
  • Develop operations documentation, handle escalations, and implement measures to prevent recurring incidents while contributing to DevOps best practices.

AWSPythonKubernetesLinuxTerraformAnsible

Posted about 2 months ago
Apply
Apply

๐Ÿ“ Poland

๐Ÿข Company: neptune.ai๐Ÿ‘ฅ 51-100๐Ÿ’ฐ $8,000,000 Series A almost 3 years agoInternetArtificial Intelligence (AI)AnalyticsInformation TechnologySoftware

  • 6+ years in SRE, DevOps, or related roles.
  • Strong experience managing and optimizing Kubernetes clusters.
  • Proven expertise in designing and implementing automation solutions, including Terraform and Helm.
  • Strong programming skills in Shell and Python.
  • Extensive experience with Linux system administration and network management.
  • Expertise in managing distributed computing systems.
  • Fluency in English with solid communication skills.
  • Own the site reliability process and systems through design, implementation, deployment, and maintenance.
  • Ensure scalability, resilience, and performance of solutions across SaaS and client-hosted environments.
  • Design and implement automation workflows to streamline operations.
  • Ensure security and compliance of infrastructure and processes.
  • Collaborate with cross-functional teams on requirements and solutions.
  • Document architecture and operational procedures.
  • Participate in on-call rotations for incident management.

PythonElasticSearchGCPJVMKafkaKotlinKubernetesMicrosoft AzureMySQLAzureClickhouseRedisRustCommunication SkillsCollaborationCI/CDLinuxDevOpsTerraformDocumentationCompliance

Posted 3 months ago
Apply
Apply

๐Ÿ“ USA

๐Ÿงญ Full-Time

๐Ÿ’ธ 211650 - 249000 USD per year

๐Ÿ” Cryptocurrency and blockchain technology

๐Ÿข Company: Coinbase Careers Page๐Ÿ‘ฅ 1000-5000

  • At least 7+ years of experience in software engineering.
  • Experience in designing, building, scaling, and maintaining production services.
  • Ability to write high-quality, well-tested code.
  • Passion for open financial systems.
  • Strong technical skills for system design and coding.
  • Excellent written and verbal communication skills.
  • Strong skills in observability, debugging, and performance tuning.
  • Strong interpersonal skills for collaboration with engineers of all levels.
  • Demonstrated critical thinking skills under pressure.
  • Willingness to understand and improve any layer of the stack.
  • On-call availability for issue resolution.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Build automation and improve systems to eliminate toil and operations work.
  • Collaborate with core infrastructure team for performance tuning and optimization of cloud deployments.
  • Work with product teams to reduce service disruptions and automate incident responses.
  • Proactively find and analyze reliability issues, implementing software solutions for improvements.
  • Educate and mentor the engineering team on reliability as a core value.
  • Write high-quality, well-tested code.
  • Debug complex technical problems and enhance system deployability.
  • Review feature designs across the company.
  • Ensure security, operational integrity, and architectural clarity of designs.
  • Integrate with third-party vendors through pipelines.
  • Participate in on-call support for urgent issues.

BlockchainCommunication SkillsSoftware EngineeringDebugging

Posted 4 months ago
Apply

Related Articles

Posted 5 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.