Senior Site Reliability Engineer ( Remote - US)

Posted about 1 month agoViewed

💎 Seniority level: Senior, 5+ years

📍 Location: United States

🔍 Industry: Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed over 2 years agoInternet

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesCI/CDTerraformNetworkingScriptingConfluence

Minimum of 5 years of experience in SRE, DevOps, or Infrastructure Engineering, demonstrating strong ownership and problem-solving skills.
Proficiency in Kubernetes, Helm, and networking security practices.
In-depth experience with AWS services such as RDS, Aurora, VPC, EKS, EC2, and IAM.
Expertise in PostgreSQL administration, including performance tuning and high availability management within AWS.
Familiarity with CI/CD tools like GitHub Actions and ArgoCD, with a focus on automation and security best practices.
Strong understanding and experience in Infrastructure as Code (IaC) using Crossplane and Terraform.
Experience in observability and monitoring with Datadog.
Proficiency in Python and Bash scripting for system automation and management.
Strong communication skills and the ability to collaborate effectively across engineering teams and document processes in Confluence.

Own initiatives related to system reliability and scalability, identifying potential issues and implementing proactive solutions to prevent them.
Participate in on-call rotations, responding to incidents, performing root cause analysis, and driving long-term fixes.
Design, deploy, and manage Kubernetes clusters, utilizing tools like Helm charts, Cilium, and Karpenter to optimize both performance and cost.
Architect and maintain AWS infrastructure, focusing on RDS/Aurora PostgreSQL, networking, and scaling best practices.
Automate infrastructure provisioning using tools like Crossplane and Terraform to maintain consistency and scalability.
Enhance observability by improving monitoring systems using Datadog and drive proactive detection and resolution of system issues.
Conduct post-incident reviews and document lessons learned, driving improvements into long-term system practices.