Apply

Senior Site Reliability Engineer

Posted about 2 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: United States

💸 Salary: 127000.0 - 249000.0 USD per year

🔍 Industry: Database and Cloud Services

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity almost 7 years agoDatabaseOpen SourceCloud ComputingSaaSSoftware

🗣️ Languages: English

🪄 Skills: Linux

Requirements:
  • Experience running a mission-critical service at scale.
  • Understanding of information security issues.
  • Prior experience with critical production systems in a Linux environment.
  • Proficiency in at least one modern programming language, beyond basic scripting.
  • Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc.).
  • Bachelor’s degree in Computer Science or equivalent experience.
  • Experience writing automation tools and eagerness to automate.
Responsibilities:
  • Design and build the infrastructure for a global cloud service that comprises hundreds of thousands of MongoDB clusters.
  • Implement and troubleshoot automation and monitoring of global services spanning several cloud providers.
  • Optimize infrastructure performance from application level to firmware.
  • Participate in a weekly on-call rotation.
  • Improve infrastructure capabilities, focusing on cost, simplicity, and maintainability.
Apply

Related Jobs

Apply

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2 months ago
Apply
Apply

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.

  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 2 months ago
Apply