Staff Site Reliability Engineer

Posted about 1 month agoViewed

💎 Seniority level: Staff, 8+ years

💸 Salary: 159408.0 - 236160.0 USD per year

🔍 Industry: Software Development

🏢 Company: Stash👥 1-10💰 Seed over 9 years agoMedical Information Technology Health Care

🗣️ Languages: English

⏳ Experience: 8+ years

Requirements:

8+ years of experience in site reliability engineering or a similar role.

Strong expertise in Kubernetes (K8s) and Amazon EKS.

Advanced skills in AWS, including setup, management, and optimization.

Proficiency in infrastructure as code, particularly Terraform and Terraform Cloud.

Solid programming skills in Python and/or Go.

Experience with system monitoring tools like Datadog and familiarity with logging and archiving practices.

Extensive experience with GitHub Actions for CI/CD pipelines.

Proven track record in designing and managing microservice architectures using Docker and containers.

Practical experience with Kafka.

Deep understanding of SLOs, SLIs, and SLAs, and their application in maintaining system reliability.

Experience working in PCI and other regulated environments.

Responsibilities:

Design, develop, and maintain scalable and resilient cloud infrastructure using AWS.

Implement and oversee monitoring systems to ensure optimal performance and rapid response to issues.

Automate deployment pipelines and manage CI/CD processes using tools like GitHub Actions.

Make high-impact architectural decisions to improve system efficiency and reduce downtime.

Collaborate with engineering teams to innovate and enhance deployment and operational capabilities.

Develop and manage microservices architectures using Docker and containerization technologies.

Apply

Related Jobs

Apply

🔥 Staff Site Reliability Engineer

Posted about 1 month ago

📍 United States, Canada

🧭 Full-Time

💸 145000.0 - 175000.0 USD per year

🔍 Information Technology

🔧 Requirements

8+ years experience in Information Technology
5+ years in desktop systems engineering and administration
Proficient in Ansible, Bash, Perl, Python, and PowerShell
Experience with Intune, AutoPilot, SCCM, Active Directory
Familiarity with AWS Workspace and AzureAD

💡 Responsibilities

Vet vendor solutions
Design and execute initiatives
Research and document new technologies
Provide training on new products and processes

AWSPythonAndroidBashAnsible

Posted about 1 month ago

Apply

🔥 Staff Site Reliability Engineer

Posted about 2 months ago

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

🔧 Requirements

Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
Deep knowledge of Kubernetes and its ecosystem.
Solid knowledge of observability systems.
Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
Ability to write software for production environments.
Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
Collaboration and learning-driven mindset.
CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
AWS Certifications.
Excellent communication skills in both English and Portuguese, both verbally and in writing.

💡 Responsibilities

Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
Improve observability, reliability, and cost awareness.
Support engineering teams in the products and tools usage.
Build and maintain a modern CI/CD set of tools and services.
Keep all the Kubernetes clusters highly available and reliable.
Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted about 2 months ago

Apply

🔥 Staff Site Reliability Engineer

Posted about 2 months ago

📍 Brazil

🔍 Corporate wellness

🏢 Company: Wellhub

🔧 Requirements

Proven technical experience with AWS cloud services and Kubernetes.
Deep knowledge of Kubernetes and related ecosystem.
Solid knowledge of observability systems.
Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
Ability to write software for production environments.
Excellent analytical and problem-solving skills.
Collaboration and learning-driven mindset.
CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
AWS Certifications.
Excellent communication skills in both English and Portuguese.

💡 Responsibilities

Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
Develop and evolve Kubernetes operators and cloud-native automation.
Build tools for engineering teams to manage their cloud resources autonomously.
Ensure security and compliance by delivering secure products and implementing DevSecOps.
Improve observability, reliability, and cost awareness.
Support other engineering teams in product and tools usage.
Build and maintain CI/CD tools and services.
Maintain highly available and reliable Kubernetes clusters.
Contribute to product documentation.
Participate in defining standards, guidelines and best practices.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted about 2 months ago

Apply

🔥 Sr Staff Site Reliability Engineer (SRE), Cloud

Posted 3 months ago

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise-scale continuous delivery environments.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
Background in Linux Systems Engineering.
Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
Experience with sustainable incident response in a blameless environment.
Comfortable with a high level of autonomy and working with a distributed team.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Lead efforts enabling shift-left monitoring in the organization.
Help identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted 3 months ago

Apply

🔥 Staff Site Reliability Engineer, PaaS

Posted 5 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic Search Search Engine Cloud Computing Vertical Search

🔧 Requirements

Strong knowledge of Golang and Python
Experience designing API Management and Kubernetes architecture
Experience with distributed systems
Experience on CI/CD setup and architecture
Knowledge of Public Cloud Providers (GCP, AWS, Azure)
Excellent communication and organization skills

💡 Responsibilities

Design and deploy a cloud-native API Management
Spearhead the design of a robust CI/CD toolchain
Lead development of observability standards
Drive the evolution of a Kubernetes-based architecture
Provide guidance and mentorship to SRE team members
Establish and enforce engineering processes
Collaborate with senior leadership on cloud infrastructure

AWSPythonGCPKubernetesMicrosoft AzureCI/CD

Posted 5 months ago

Apply

Why remote work is such a nice opportunity?

Posted 10 days ago

Why is remote work so nice? Let's try to see!

Remote Job Certifications and Courses to Boost Your Career

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 7 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 7 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 7 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Staff Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Why remote work is such a nice opportunity?

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully