Apply

Senior Site Reliability Engineer ( Remote - US)

Posted about 1 month agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: United States

🔍 Industry: Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed over 2 years agoInternet

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesCI/CDTerraformNetworkingScriptingConfluence

Requirements:
  • Minimum of 5 years of experience in SRE, DevOps, or Infrastructure Engineering, demonstrating strong ownership and problem-solving skills.
  • Proficiency in Kubernetes, Helm, and networking security practices.
  • In-depth experience with AWS services such as RDS, Aurora, VPC, EKS, EC2, and IAM.
  • Expertise in PostgreSQL administration, including performance tuning and high availability management within AWS.
  • Familiarity with CI/CD tools like GitHub Actions and ArgoCD, with a focus on automation and security best practices.
  • Strong understanding and experience in Infrastructure as Code (IaC) using Crossplane and Terraform.
  • Experience in observability and monitoring with Datadog.
  • Proficiency in Python and Bash scripting for system automation and management.
  • Strong communication skills and the ability to collaborate effectively across engineering teams and document processes in Confluence.
Responsibilities:
  • Own initiatives related to system reliability and scalability, identifying potential issues and implementing proactive solutions to prevent them.
  • Participate in on-call rotations, responding to incidents, performing root cause analysis, and driving long-term fixes.
  • Design, deploy, and manage Kubernetes clusters, utilizing tools like Helm charts, Cilium, and Karpenter to optimize both performance and cost.
  • Architect and maintain AWS infrastructure, focusing on RDS/Aurora PostgreSQL, networking, and scaling best practices.
  • Automate infrastructure provisioning using tools like Crossplane and Terraform to maintain consistency and scalability.
  • Enhance observability by improving monitoring systems using Datadog and drive proactive detection and resolution of system issues.
  • Conduct post-incident reviews and document lessons learned, driving improvements into long-term system practices.
Apply