Apply

Staff Site Reliability Engineer

Posted 3 days agoViewed

View full description

💎 Seniority level: Staff, 8+ years

💸 Salary: 159408.0 - 236160.0 USD per year

🔍 Industry: Financial technology

🏢 Company: Stash👥 1-10💰 Seed over 9 years agoMedicalInformation TechnologyHealth Care

⏳ Experience: 8+ years

Requirements:
  • 8+ years of experience in site reliability engineering or a similar role.
  • Strong expertise in Kubernetes (K8s) and Amazon EKS.
  • Advanced skills in AWS, including setup, management, and optimization.
  • Proficiency in infrastructure as code, particularly Terraform and Terraform Cloud.
  • Solid programming skills in Python and/or Go.
  • Experience with system monitoring tools like Datadog and familiarity with logging and archiving practices.
  • Extensive experience with GitHub Actions for CI/CD pipelines.
  • Proven track record in designing and managing microservice architectures using Docker and containers.
  • Practical experience with Kafka.
  • Deep understanding of SLOs, SLIs, and SLAs, and their application in maintaining system reliability.
  • Experience working in PCI and other regulated environments.
Responsibilities:
  • Design, develop, and maintain scalable and resilient cloud infrastructure using AWS.
  • Implement and oversee monitoring systems to ensure optimal performance and rapid response to issues.
  • Automate deployment pipelines and manage CI/CD processes using tools like GitHub Actions.
  • Make high-impact architectural decisions to improve system efficiency and reduce downtime.
  • Collaborate with engineering teams to innovate and enhance deployment and operational capabilities.
  • Develop and manage microservices architectures using Docker and containerization technologies.
Apply

Related Jobs

Apply

🧭 Full-Time

💸 145000.0 - 175000.0 USD per year

🔍 Entertainment

  • Bachelor’s degree in computer science, Information Technology, or a related field (or equivalent experience).
  • 8+ years’ experience in Information Technology, with 5+ years in desktop/end user systems engineering and administration.
  • Comfortable with IT security and compliance best practices.
  • Ability to build effective cross-functional relationships.
  • Experience with automated workstation build methodologies, software packaging, and deployment systems.
  • System administration experience with Windows, Linux, and macOS.
  • Familiarity with automation and scripting languages such as Ansible, Bash, Perl, Python, and PowerShell.
  • Experience with Intune, AutoPilot, SCCM, Nexthink, Active Directory, Jamf, and M365.
  • Knowledge in AWS Workspace, VMWare Horizon Cloud, Citrix Workspace, Microsoft W365, and/or Microsoft AzureAD.
  • Proficient in implementing and supporting MDM tools.
  • Design and operate global workplace solutions used by NBCUniversal employees and partners.
  • Manage device lifecycle and health analytics for corporate and personal devices.
  • Vet vendor solutions and execute initiatives as a technology expert.
  • Research new technologies and beta test products.
  • Document and train engineers and operations technicians on new processes.
  • Collaborate across multiple teams including Engineering, Operations, Network, and Security.
Posted 2 days ago
Apply
Apply

📍 USA

🧭 Full-Time

💸 147000.0 - 289000.0 USD per year

🔍 Software and data

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity almost 7 years agoDatabaseOpen SourceCloud ComputingSaaSSoftware

  • 10+ years of experience working on software and operating distributed systems.
  • Deep expertise in networking fundamentals and understanding of the internet, including TCP/IP, DNS, TLS/mTLS, BGP.
  • Familiarity with modern cloud-based infrastructure, network design primitives of AWS, Azure, or GCP.
  • Strong knowledge of service mesh and load-balancing concepts.
  • Participate in the development of a reliable and resilient multi-cloud globally-connected network for MongoDB's services.
  • Collaborate with service-owning teams to address technical issues and provide guidance on best practices for service-to-service connectivity.
  • Participate in a 24/7 on-call rotation to resolve issues related to network architecture, ensuring high availability.
Posted 13 days ago
Apply
Apply

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

📍 Brazil

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKafkaKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

📍 Brazil

🔍 Corporate wellness

🏢 Company: Wellhub

  • Proven technical experience with AWS cloud services and Kubernetes.
  • Deep knowledge of Kubernetes and related ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and cloud-native automation.
  • Build tools for engineering teams to manage their cloud resources autonomously.
  • Ensure security and compliance by delivering secure products and implementing DevSecOps.
  • Improve observability, reliability, and cost awareness.
  • Support other engineering teams in product and tools usage.
  • Build and maintain CI/CD tools and services.
  • Maintain highly available and reliable Kubernetes clusters.
  • Contribute to product documentation.
  • Participate in defining standards, guidelines and best practices.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D over 2 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted about 1 month ago
Apply
Apply

🧭 Full-Time

🔍 Observability and data management

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Proficiency in JavaScript/Node.js/TypeScript development within Linux/Mac environments.
  • Experience with Configuration Management Tools like Terraform, Puppet, Chef, or Ansible.
  • Knowledge of cloud platforms, primarily AWS and Azure, with GCP being a bonus.
  • Extensive experience in designing and implementing observability platforms using OpenSource tools like Grafana and Prometheus.
  • Experience mentoring engineers and serving as a Subject Matter Expert in Monitoring and Observability.
  • Familiarity with native monitoring services in AWS, Azure, and other cloud platforms.
  • Background in Linux Systems Engineering.
  • Experience with incident response tools such as PagerDuty or FireHydrant.
  • Comfortable working autonomously in a distributed team environment.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems focusing on availability, latency, and overall system health.
  • Design observability systems for various applications using Cribl products and OpenSource tools.
  • Identify the causes of errors and instabilities in production cloud services and drive improvements.
  • Work with product and platform teams to enhance systems for better reliability and resilience.
  • Lead the efforts for shift-left monitoring and reduce operational toil through innovation.
  • Participate in on-call responsibilities.
Posted about 1 month ago
Apply
Apply

🧭 Full-Time

💸 180000.0 - 240000.0 USD per year

🔍 IT and Security

  • Extensive experience with enterprise scale continuous delivery environments
  • 8+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools like PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities
Posted about 2 months ago
Apply
Apply

📍 Virginia, USA

🧭 Full-Time

💸 136500.0 - 195000.0 USD per year

🔍 Cybersecurity, Cloud Security

🏢 Company: Zscaler

  • Over 5 years of Site Reliability Engineering experience in both Operations and Engineering environments.
  • Extensive experience with High/Moderate FedRAMP authorization levels and monthly monitoring, including vulnerability scanning, evaluation, patching, and reporting.
  • Proficiency in Linux administration, network troubleshooting, and automation tools like Ansible and Terraform for infrastructure as code.
  • Skilled in Python coding, with knowledge of container-based architectures (AWS ECS, Kubernetes), virtualization, cloud services, web security, and networking protocols (HTTP, SSL/TLS, DNS, SQL).
  • Oversee operational tasks for FedRAMP cloud products, including deployments, on-call duties, and incident management.
  • Participate in regular deployment sync meetings and operational hand-offs.
  • Manage all cloud infrastructure components such as AWS GovCloud, private cloud environments, containers, and VMs.
  • Develop operations documentation, handle escalations, and implement measures to prevent recurring incidents while contributing to DevOps best practices.

AWSPythonKubernetesLinuxTerraformAnsible

Posted about 2 months ago
Apply
Apply

📍 Poland

🏢 Company: neptune.ai👥 51-100💰 $8,000,000 Series A almost 3 years agoInternetArtificial Intelligence (AI)AnalyticsInformation TechnologySoftware

  • 6+ years in SRE, DevOps, or related roles.
  • Strong experience managing and optimizing Kubernetes clusters.
  • Proven expertise in designing and implementing automation solutions, including Terraform and Helm.
  • Strong programming skills in Shell and Python.
  • Extensive experience with Linux system administration and network management.
  • Expertise in managing distributed computing systems.
  • Fluency in English with solid communication skills.
  • Own the site reliability process and systems through design, implementation, deployment, and maintenance.
  • Ensure scalability, resilience, and performance of solutions across SaaS and client-hosted environments.
  • Design and implement automation workflows to streamline operations.
  • Ensure security and compliance of infrastructure and processes.
  • Collaborate with cross-functional teams on requirements and solutions.
  • Document architecture and operational procedures.
  • Participate in on-call rotations for incident management.

PythonElasticSearchGCPJVMKafkaKotlinKubernetesMicrosoft AzureMySQLAzureClickhouseRedisRustCommunication SkillsCollaborationCI/CDLinuxDevOpsTerraformDocumentationCompliance

Posted 3 months ago
Apply

Related Articles

Posted 5 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.