Apply

Staff Site Reliability Engineer

Posted 2024-11-09

View full description

💎 Seniority level: Staff, Proven experience as a Staff SRE or in a similar SRE role.

📍 Location: CA, CO, CT, FL, GA, HI, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA

💸 Salary: 135520 - 178060 USD per year

🔍 Industry: Non-profit mental health support

🏢 Company: Crisis Text Line

🗣️ Languages: English

⏳ Experience: Proven experience as a Staff SRE or in a similar SRE role.

🪄 Skills: AWSDockerGraphQLPHPPythonGCPKubernetesAzureData StructuresGoNext.jsCommunication SkillsCollaborationCI/CDDevOpsTerraformCompliance

Requirements:
  • Bachelor's degree in Computer Science, Engineering, or related field; Master’s preferred.
  • Proven experience as a Staff SRE or in a similar role.
  • Maintaining reliability of online SaaS/PaaS.
  • Proficiency in AWS and infrastructure as code (Terraform, CloudFormation).
  • Strong scripting skills (Python) and knowledge of containerization (Docker, Kubernetes).
  • Experience in CI/CD pipelines and observability tools (GitHub Actions, Datadog).
  • Understanding of network protocols and security principles.
Responsibilities:
  • Assisting to lead and mentor a team of 5 SREs.
  • Designing, implementing, and maintaining AWS infrastructure.
  • Collaborating with developers for performance optimization.
  • Developing monitoring, logging, and alerting systems.
  • Automating repetitive tasks to improve efficiency.
  • Responding to incidents to minimize downtime.
  • Supporting diversity on the engineering team.
  • Communicating expectations and progress clearly.
  • Providing mentorship and promoting technical best practices.
  • Participating in retrospectives to improve processes.
  • Conducting regular security audits.
Apply

Related Jobs

Apply

📍 USA

🧭 Full-Time

💸 211650 - 249000 USD per year

🔍 Cryptocurrency and blockchain technology

🏢 Company: Coinbase Careers Page

  • At least 7+ years of experience in software engineering.
  • Experience in designing, building, scaling, and maintaining production services.
  • Ability to write high-quality, well-tested code.
  • Passion for open financial systems.
  • Strong technical skills for system design and coding.
  • Excellent written and verbal communication skills.
  • Strong skills in observability, debugging, and performance tuning.
  • Strong interpersonal skills for collaboration with engineers of all levels.
  • Demonstrated critical thinking skills under pressure.
  • Willingness to understand and improve any layer of the stack.
  • On-call availability for issue resolution.

  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Build automation and improve systems to eliminate toil and operations work.
  • Collaborate with core infrastructure team for performance tuning and optimization of cloud deployments.
  • Work with product teams to reduce service disruptions and automate incident responses.
  • Proactively find and analyze reliability issues, implementing software solutions for improvements.
  • Educate and mentor the engineering team on reliability as a core value.
  • Write high-quality, well-tested code.
  • Debug complex technical problems and enhance system deployability.
  • Review feature designs across the company.
  • Ensure security, operational integrity, and architectural clarity of designs.
  • Integrate with third-party vendors through pipelines.
  • Participate in on-call support for urgent issues.

BlockchainCommunication Skills

Posted 2024-10-16
Apply
Apply

📍 United States of America

🧭 Full-Time

💸 $176,400 - $201,600 per year

🔍 Family history and personal DNA testing

  • 7+ years of experience in site reliability.
  • 5+ years software development experience.
  • 7+ years cloud automation experience using Go, Python, Bash.
  • 5+ years debugging Node.js, Java, and a variety of DB technologies.
  • 5+ years of experience working with AWS Cloud, including services, CLI, SDKs, and AWS Console.
  • 7+ years using Cloud APM and logging tools, such as NewRelic, Prometheus, and AWS monitoring.
  • 5+ years experience in auto scaling, resilience, fault tolerance, AWS infrastructure, cloud networking, and container management.
  • 5+ years experience analyzing production within a cloud environment.
  • 5+ years of Terraform or Cloud Formation experience for infrastructure management with CI/CD pipeline.

  • Own site reliability for a product vertical in collaboration with engineering.
  • Define and ensure SLO / SLI and error budgets remain in compliance with standards.
  • Develop improved monitoring, auto scaling and resiliency patterns and capabilities.
  • Debug complex issues across multiple services in AWS, including outfacing infrastructure.
  • Collaborate and develop cloud automation and new best practices in support of vertical and organization.
  • Train, mentor and support in AWS, Infrastructure and Cloud best practices.
  • Member of Site Reliability Engineering team which reports up to Site Reliability and Performance Organization.

AWSNode.jsPythonSoftware DevelopmentBashJavaGoPrometheusCollaborationCI/CD

Posted 2024-09-20
Apply