Senior Site Reliability Engineer - (Platform)

Posted 7 months agoViewed

💎 Seniority level: Senior, 5+ years

📍 Location: USA

💸 Salary: 180625 - 212000 USD per year

🔍 Industry: Cryptocurrency and Financial Technology

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: Communication SkillsSoftware EngineeringDebugging

Improve observability, reliability and availability by defining and measuring key metrics.
Build automation and improve systems to eliminate toil and operations work.
Collaborate with core infrastructure team on performance tuning and optimization.
Work with product teams to reduce service disruptions and automate incident response.
Proactively analyze reliability issues and implement improvements.
Educate and mentor engineering team on reliability practices.
Write high quality, well tested code.
Debug complex technical problems.
Review feature designs for reliability.
Ensure security, safety, scale, and operational integrity.
Build integration pipelines with third-party vendors.
Participate in on-call support for urgent issues.

Posted 7 months ago

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🔧 Requirements

At least 5+ years of software engineering experience.
Strong understanding of data structures and algorithms related to performance and reliability.
Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
Strong skills around observability, debugging, and performance tuning.
Ability to debug complex systems and willingness to understand and improve any layer of the stack.
Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
Strong communication skills and ability to explain technical concepts clearly.
Demonstrated critical thinking under pressure.

💡 Responsibilities

Build automation and improve systems to eliminate toil and operations work.
Improve observability, reliability, and availability by defining and measuring key metrics.
Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems and design software for improvements.
Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 7 months ago