Experience writing efficient code in programming languages such as Python, Golang, Java, or Rust.
Experience in developing software applications and tools from scratch.
Experience in designing and managing complex systems with a focus on cost, performance, scalability, and resilience.
Expertise in operating Linux-based systems including troubleshooting and monitoring.
Experience managing large-scale infrastructures on Baremetal, public and private cloud (AWS, GCP, Azure), and containerized infrastructures (Kubernetes, Docker).
Understanding of various protocols across the stack (HTTP, DNS, DHCP, etc.).
Experience with Infrastructure as Code (IaaC) tools like Terraform or Pulumi, and configuration management tools (Ansible, Puppet, Chef).
Experience with one or more CI/CD solutions (Jenkins, ArgoCD, etc.).
Experience implementing monitoring and logging solutions (Prometheus, Grafana, etc.).
Past experience leading a team is a plus.
Strong communication skills.
Responsibilities:
Design, build, and refactor software components to enhance availability, resilience, performance, and efficiency.
Participate in on-call rotation to respond to infrastructure incidents.
Proactively address infrastructure bugs and bottlenecks.
Define and choose appropriate SLI/SLOs based on system needs.
Reduce noisy alerts and improve incident management processes.
Identify and resolve design bottlenecks.
Mentor new hires on tools and infrastructure.
Address code complexity and software bugs.
Support team members with code issues and participate in code reviews.