Apply

Senior Site Reliability Engineer - Midnight

Posted about 14 hours agoViewed

View full description

💎 Seniority level: Senior, 7+ years

📍 Location: United States

🔍 Industry: Blockchain

🏢 Company: IO Global

⏳ Experience: 7+ years

🪄 Skills: AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Requirements:
  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
Responsibilities:
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.
Apply