Apply

Staff Site Reliability Engineer

Posted 2024-10-18

View full description

๐Ÿ’Ž Seniority level: Staff, 3+ years in SRE role, 2+ years with Kubernetes, 2+ years with Grafana

๐Ÿ’ธ Salary: 120000 - 135000 USD per year

๐Ÿ” Industry: Cloud Computing

๐Ÿข Company: Vultr

๐Ÿ—ฃ๏ธ Languages: English

โณ Experience: 3+ years in SRE role, 2+ years with Kubernetes, 2+ years with Grafana

Requirements:
  • 3+ years of experience in a hands-on Site Reliability Engineer role delivering distributed architectures.
  • 2+ years of experience maintaining Kubernetes clusters in highly available environments.
  • 2+ years of hands-on experience with the modern Grafana stack, including Mimir, Loki, and Tempo.
  • Comfortable working with complex CI/CD pipelines (Gitlab/Jenkins), configuration management (Puppet/Salt), and IaC solutions such as Terraform.
  • Experience with observability pipelines or Open Telemetry is a plus.
  • Strong background in performance optimization for web stacks and programming experience in Python, Golang, or PHP.
Responsibilities:
  • Collaborate with cross-functional teams to craft and implement a modern observability stack and refine incident-handling processes.
  • Design and contribute to cloud solutions for high-performance computing and AI workloads.
  • Enhance system resilience through thoughtful software improvements and automation.
  • Develop documentation for junior SREs to manage recurring issues confidently.
  • Identify scalable solutions for technical challenges and innovate within the stack.
Apply

Related Jobs

Apply

๐Ÿ“ Poland

๐Ÿข Company: neptune.ai

  • 6+ years in SRE, DevOps, or related roles.
  • Strong experience managing and optimizing Kubernetes clusters.
  • Proven expertise in designing and implementing automation solutions, including Terraform and Helm.
  • Strong programming skills in Shell and Python.
  • Extensive experience with Linux system administration and network management.
  • Expertise in managing distributed computing systems.
  • Fluency in English with solid communication skills.

  • Own the site reliability process and systems through design, implementation, deployment, and maintenance.
  • Ensure scalability, resilience, and performance of solutions across SaaS and client-hosted environments.
  • Design and implement automation workflows to streamline operations.
  • Ensure security and compliance of infrastructure and processes.
  • Collaborate with cross-functional teams on requirements and solutions.
  • Document architecture and operational procedures.
  • Participate in on-call rotations for incident management.

PythonElasticSearchGCPJVMKafkaKotlinKubernetesMicrosoft AzureMySQLAzureClickhouseElasticsearchRedisRustCommunication SkillsCollaborationCI/CDLinuxDevOpsTerraformDocumentationCompliance

Posted 2024-11-16
Apply
Apply

๐Ÿงญ Full-Time

๐Ÿ” Cannabis industry

๐Ÿข Company: Weedmaps

  • 10+ years of Site Reliability/DevOps and/or Software Engineering experience with 3+ years of SaaS architectural experience.
  • Ability to demonstrate extremely high levels of technical knowledge spanning System architecture, Frontend, Backend, Mobile, and DevOps.
  • Proven ability to give effective presentations to management, executives, and cross-functional teams.
  • Broad knowledge of distributed systems design and performance testing patterns for microservices.
  • Experience identifying operational toil and producing automations to remedy toil.

  • Collaborate with leadership and engineering teams to design and build technical solutions across the Weedmaps ecosystem.
  • Work in a service-oriented architecture spanning multiple domains and technology stacks.
  • Drive critical path towards completing major company initiatives.
  • Mentor and train engineers across Weedmaps Engineering.
  • Develop architectures that are secure, scalable, robust, modular, API-centric, and global.
  • Write proposals with backing documentation including business case, architecture/solution diagrams, and development timeline.
Posted 2024-11-16
Apply
Apply

๐Ÿ“ CA, CO, CT, FL, GA, HI, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA

๐Ÿงญ Full-Time

๐Ÿ’ธ 135520 - 178060 USD per year

๐Ÿ” Non-profit mental health support

๐Ÿข Company: Crisis Text Line

  • Bachelor's degree in Computer Science, Engineering, or related field; Masterโ€™s preferred.
  • Proven experience as a Staff SRE or in a similar role.
  • Maintaining reliability of online SaaS/PaaS.
  • Proficiency in AWS and infrastructure as code (Terraform, CloudFormation).
  • Strong scripting skills (Python) and knowledge of containerization (Docker, Kubernetes).
  • Experience in CI/CD pipelines and observability tools (GitHub Actions, Datadog).
  • Understanding of network protocols and security principles.

  • Assisting to lead and mentor a team of 5 SREs.
  • Designing, implementing, and maintaining AWS infrastructure.
  • Collaborating with developers for performance optimization.
  • Developing monitoring, logging, and alerting systems.
  • Automating repetitive tasks to improve efficiency.
  • Responding to incidents to minimize downtime.
  • Supporting diversity on the engineering team.
  • Communicating expectations and progress clearly.
  • Providing mentorship and promoting technical best practices.
  • Participating in retrospectives to improve processes.
  • Conducting regular security audits.

AWSDockerGraphQLPHPPythonGCPKubernetesAzureData StructuresGoNext.jsCommunication SkillsCollaborationCI/CDDevOpsTerraformCompliance

Posted 2024-11-09
Apply
Apply

๐Ÿ“ USA

๐Ÿงญ Full-Time

๐Ÿ’ธ 211650 - 249000 USD per year

๐Ÿ” Cryptocurrency and blockchain technology

๐Ÿข Company: Coinbase Careers Page

  • At least 7+ years of experience in software engineering.
  • Experience in designing, building, scaling, and maintaining production services.
  • Ability to write high-quality, well-tested code.
  • Passion for open financial systems.
  • Strong technical skills for system design and coding.
  • Excellent written and verbal communication skills.
  • Strong skills in observability, debugging, and performance tuning.
  • Strong interpersonal skills for collaboration with engineers of all levels.
  • Demonstrated critical thinking skills under pressure.
  • Willingness to understand and improve any layer of the stack.
  • On-call availability for issue resolution.

  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Build automation and improve systems to eliminate toil and operations work.
  • Collaborate with core infrastructure team for performance tuning and optimization of cloud deployments.
  • Work with product teams to reduce service disruptions and automate incident responses.
  • Proactively find and analyze reliability issues, implementing software solutions for improvements.
  • Educate and mentor the engineering team on reliability as a core value.
  • Write high-quality, well-tested code.
  • Debug complex technical problems and enhance system deployability.
  • Review feature designs across the company.
  • Ensure security, operational integrity, and architectural clarity of designs.
  • Integrate with third-party vendors through pipelines.
  • Participate in on-call support for urgent issues.

BlockchainCommunication Skills

Posted 2024-10-16
Apply
Apply

๐Ÿ“ Poland

๐Ÿ” IT and Security

๐Ÿข Company: Cribl๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $150.0m Series D on 2022-05-24Real TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with sustainable incident response in a blameless environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering.
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health.
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinuxTerraform

Posted 2024-10-03
Apply
Apply

๐Ÿ“ United States of America

๐Ÿงญ Full-Time

๐Ÿ’ธ $176,400 - $201,600 per year

๐Ÿ” Family history and personal DNA testing

  • 7+ years of experience in site reliability.
  • 5+ years software development experience.
  • 7+ years cloud automation experience using Go, Python, Bash.
  • 5+ years debugging Node.js, Java, and a variety of DB technologies.
  • 5+ years of experience working with AWS Cloud, including services, CLI, SDKs, and AWS Console.
  • 7+ years using Cloud APM and logging tools, such as NewRelic, Prometheus, and AWS monitoring.
  • 5+ years experience in auto scaling, resilience, fault tolerance, AWS infrastructure, cloud networking, and container management.
  • 5+ years experience analyzing production within a cloud environment.
  • 5+ years of Terraform or Cloud Formation experience for infrastructure management with CI/CD pipeline.

  • Own site reliability for a product vertical in collaboration with engineering.
  • Define and ensure SLO / SLI and error budgets remain in compliance with standards.
  • Develop improved monitoring, auto scaling and resiliency patterns and capabilities.
  • Debug complex issues across multiple services in AWS, including outfacing infrastructure.
  • Collaborate and develop cloud automation and new best practices in support of vertical and organization.
  • Train, mentor and support in AWS, Infrastructure and Cloud best practices.
  • Member of Site Reliability Engineering team which reports up to Site Reliability and Performance Organization.

AWSNode.jsPythonSoftware DevelopmentBashJavaGoPrometheusCollaborationCI/CD

Posted 2024-09-20
Apply
Apply

๐Ÿงญ Full-Time

๐Ÿ’ธ 144000 - 278000 USD per year

๐Ÿ” IT and Security

๐Ÿข Company: Cribl๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $150.0m Series D on 2022-05-24Real TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • 8+ years of experience with a DevOps or SRE job title.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Experience with sustainable incident response in a blameless environment.
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
  • Experience with APM and Observability and related tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering.
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health.
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinuxDevOpsTerraform

Posted 2024-08-28
Apply
Apply

๐Ÿงญ Full-Time

๐Ÿ’ธ 152000 - 230500 USD per year

๐Ÿ” Data observability

๐Ÿข Company: Cribl๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $150.0m Series D on 2022-05-24Real TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • 5+ years of experience with a DevOps or SRE job title.
  • Development experience with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform, Puppet, Chef, or Ansible.
  • Experience with sustainable incident response in a blameless environment.
  • Knowledge of cloud platforms, preferably AWS, and container orchestration technologies.
  • Experience with APM and observability tools such as New Relic, Splunk, CloudWatch, Prometheus, and Grafana/Kibana.
  • Background in Linux Systems Engineering.
  • Experience with incident response tools like PagerDuty and FireHydrant.
  • Comfortable working autonomously and with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with a focus on availability, latency, and overall system health.
  • Seek out the causes of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to lobby for changes that improve reliability, resilience, and observability.
  • Identify and drive down toil with creative innovation and automation.
  • Participate in on-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusDevOps

Posted 2024-08-10
Apply

Related Articles

Remote Job Certifications and Courses to Boost Your Career

August 22, 2024

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

August 19, 2024

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

August 18, 2024

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

August 16, 2024

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

August 13, 2024

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.