Apply📍 Canada
🔍 Software Development
🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed almost 2 years agoInternet
- 4+ years of experience in Site Reliability Engineering or a similar role with a strong focus on cloud infrastructure.
- Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.
- Deep knowledge of AWS cloud services and best practices for scalable and secure architectures.
- Hands-on experience with Confluent Cloud and Kafka for distributed data streaming.
- Strong experience with Redis for caching and RDS for data storage.
- Proficiency with OpenSearch/ElasticSearch/ChaosSearch for search and analytics.
- Advanced knowledge of monitoring tools like Prometheus, Grafana, Alert Manager, and OpsGenie.
- Experience with LaunchDarkly for feature flag management.
- Extensive experience managing Kubernetes clusters, including Helm for package management, ArgoCD for deployments, and Istio for service mesh configurations.
- Familiarity with Kustomize for Kubernetes resource configuration.
- Strong problem-solving skills and ability to troubleshoot complex systems in production environments.
- Excellent communication and collaboration skills within agile teams.
- Design, build, and maintain highly scalable cloud infrastructure using Terraform and Terragrunt for automated resource provisioning.
- Manage and optimize AWS cloud environments, ensuring security, cost efficiency, and high availability.
- Oversee data streaming platforms using Confluent Cloud and Kafka, ensuring reliable data pipelines.
- Deploy and manage Redis instances for caching and real-time data processing.
- Implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie.
- Enable feature flag management and controlled rollouts with LaunchDarkly.
- Manage Kubernetes clusters, utilizing Helm, ArgoCD, Istio, and Kustomize for continuous deployment and infrastructure-as-code practices.
- Collaborate with development teams to integrate new services into the infrastructure seamlessly.
- Troubleshoot complex system issues to maintain high availability and performance.
- Continuously improve automation tools, processes, and methodologies to enhance system scalability.
AWSAmazon RDSKafkaKubernetesGrafanaPrometheusRedisCI/CDProblem SolvingTerraform
Posted 7 days ago
Apply