ApplySenior Site Reliability Engineer - Midnight
Posted about 14 hours agoViewed
View full description
💎 Seniority level: Senior, 7+ years
📍 Location: United States
🔍 Industry: Blockchain
🏢 Company: IO Global
⏳ Experience: 7+ years
🪄 Skills: AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting
Requirements:
- 7+ years of experience in SRE, DevOps, or a related role.
- Understanding of SRE best practices, architectures, and methods.
- Good knowledge on resiliency patterns and cloud security.
- Strong programming proficiency in Python, Golang, or Javascript.
- Demonstrated experience with AWS and modern cloud architectures.
- Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
- Hands-on experience with Kubernetes/EKS and GitOps methodologies.
- Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
- Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
- Ability to engage in technical discussions and be part of the decision making process
- Strong problem-solving skills and capability to work on complex systems
- Experience in working within an Agile environment
- Experience in working with a distributed team
- Strong communication and collaboration abilities to work seamlessly across different teams.
- A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
Responsibilities:
- Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
- Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
- Leverage GitOps principles to automate deployments and manage container orchestration.
- Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
- Develop automation tools and scripts to improve operational efficiency.
- Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
- Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
- Collaborate with dev teams to define and implement SLOs/SLIs
- Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
- Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
- Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
- Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
- Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.
Apply