Apply

Staff Site Reliability Engineer

Posted about 1 month agoViewed

View full description

💎 Seniority level: Staff, 8+ years

💸 Salary: 159408.0 - 236160.0 USD per year

🔍 Industry: Software Development

🏢 Company: Stash👥 1-10💰 Seed over 9 years agoMedicalInformation TechnologyHealth Care

🗣️ Languages: English

⏳ Experience: 8+ years

Requirements:
  • 8+ years of experience in site reliability engineering or a similar role.
  • Strong expertise in Kubernetes (K8s) and Amazon EKS.
  • Advanced skills in AWS, including setup, management, and optimization.
  • Proficiency in infrastructure as code, particularly Terraform and Terraform Cloud.
  • Solid programming skills in Python and/or Go.
  • Experience with system monitoring tools like Datadog and familiarity with logging and archiving practices.
  • Extensive experience with GitHub Actions for CI/CD pipelines.
  • Proven track record in designing and managing microservice architectures using Docker and containers.
  • Practical experience with Kafka.
  • Deep understanding of SLOs, SLIs, and SLAs, and their application in maintaining system reliability.
  • Experience working in PCI and other regulated environments.
Responsibilities:
  • Design, develop, and maintain scalable and resilient cloud infrastructure using AWS.
  • Implement and oversee monitoring systems to ensure optimal performance and rapid response to issues.
  • Automate deployment pipelines and manage CI/CD processes using tools like GitHub Actions.
  • Make high-impact architectural decisions to improve system efficiency and reduce downtime.
  • Collaborate with engineering teams to innovate and enhance deployment and operational capabilities.
  • Develop and manage microservices architectures using Docker and containerization technologies.
Apply

Related Jobs

Apply
🔥 Staff Site Reliability Engineer
Posted about 1 month ago

📍 United States, Canada

🧭 Full-Time

💸 145000.0 - 175000.0 USD per year

🔍 Information Technology

  • 8+ years experience in Information Technology
  • 5+ years in desktop systems engineering and administration
  • Proficient in Ansible, Bash, Perl, Python, and PowerShell
  • Experience with Intune, AutoPilot, SCCM, Active Directory
  • Familiarity with AWS Workspace and AzureAD
  • Vet vendor solutions
  • Design and execute initiatives
  • Research and document new technologies
  • Provide training on new products and processes

AWSPythonAndroidBashAnsible

Posted about 1 month ago
Apply
Apply
🔥 Staff Site Reliability Engineer
Posted about 2 months ago

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted about 2 months ago
Apply
Apply
🔥 Staff Site Reliability Engineer
Posted about 2 months ago

📍 Brazil

🔍 Corporate wellness

🏢 Company: Wellhub

  • Proven technical experience with AWS cloud services and Kubernetes.
  • Deep knowledge of Kubernetes and related ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and cloud-native automation.
  • Build tools for engineering teams to manage their cloud resources autonomously.
  • Ensure security and compliance by delivering secure products and implementing DevSecOps.
  • Improve observability, reliability, and cost awareness.
  • Support other engineering teams in product and tools usage.
  • Build and maintain CI/CD tools and services.
  • Maintain highly available and reliable Kubernetes clusters.
  • Contribute to product documentation.
  • Participate in defining standards, guidelines and best practices.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted about 2 months ago
Apply
Apply

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted 3 months ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic SearchSearch EngineCloud ComputingVertical Search

  • Strong knowledge of Golang and Python
  • Experience designing API Management and Kubernetes architecture
  • Experience with distributed systems
  • Experience on CI/CD setup and architecture
  • Knowledge of Public Cloud Providers (GCP, AWS, Azure)
  • Excellent communication and organization skills
  • Design and deploy a cloud-native API Management
  • Spearhead the design of a robust CI/CD toolchain
  • Lead development of observability standards
  • Drive the evolution of a Kubernetes-based architecture
  • Provide guidance and mentorship to SRE team members
  • Establish and enforce engineering processes
  • Collaborate with senior leadership on cloud infrastructure

AWSPythonGCPKubernetesMicrosoft AzureCI/CD

Posted 5 months ago
Apply

Related Articles

Posted 10 days ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 7 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 7 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 7 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.