Apply

Site Reliability Engineer

Posted over 1 year agoViewed

View full description

💸 Salary: $140,000 usd to $165,000 usd

🔍 Industry: Education technology

🗣️ Languages: English

🪄 Skills: AWS

Requirements:
Expertise in building and maintaining infrastructure on aws, understanding of event-driven architecture, experience with aws lambda and kinesis, ability to analyze and synthesize data, strong attention to detail, excellent communication skills, ability to work iteratively and multitask
Responsibilities:
Own the infrastructure for recommender systems and application data services, ensure slas are met, build systems for end-to-end delivery of data and functionality, propose and implement features to improve products, evangelize data and ml capabilitiesApply

Related Jobs

Apply

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.

  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 7 days ago
Apply
Apply

📍 Brazil

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.

  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKafkaKubernetesRubyGrafanaPrometheusCI/CD

Posted 7 days ago
Apply
Apply

📍 Brazil

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably cross plane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.

  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation.
  • Participate in the definition of standards, RFCs, guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesGoGrafanaPrometheusCI/CD

Posted 7 days ago
Apply
Apply

📍 Portugal, Brazil

🔍 Wellness

🏢 Company: Wellhub

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably cross plane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.

  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to product documentation.
  • Participate in the definition of standards, RFCs, guidelines, and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 7 days ago
Apply
Apply

📍 Brazil

🔍 Corporate wellness

🏢 Company: Wellhub

  • Proven technical experience with AWS cloud services and Kubernetes.
  • Deep knowledge of Kubernetes and related ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese.

  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and cloud-native automation.
  • Build tools for engineering teams to manage their cloud resources autonomously.
  • Ensure security and compliance by delivering secure products and implementing DevSecOps.
  • Improve observability, reliability, and cost awareness.
  • Support other engineering teams in product and tools usage.
  • Build and maintain CI/CD tools and services.
  • Maintain highly available and reliable Kubernetes clusters.
  • Contribute to product documentation.
  • Participate in defining standards, guidelines and best practices.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 7 days ago
Apply
Apply

📍 Japan

🧭 Full-Time

🔍 FinTech

🏢 Company: PayPay👥 1001-5000💰 $67,399,518 over 4 years agoInternetMobile PaymentsFinanceFinTech

  • 5+ years of experience as a Site Reliability Engineer or Tech Lead.
  • 5+ years of experience in AWS and EKS.
  • Experience designing and operating large-scale observability with Victoria Metrics.
  • Senior-level understanding of AWS cloud architecture.
  • Proficient in programming languages such as Python, Go, or Rust.
  • Strong troubleshooting and problem-solving skills.
  • Passion for improving observability practices and driving innovation.

  • Lead the engineering team in the architecture, implementation, and optimization of the platform.
  • Drive continuous improvement of observability tools and practices.
  • Collaborate with other engineering teams on reliability and performance.
  • Automate incident response to proactively address customer impact.
  • Mentor the technical skills of the engineering team.
  • Communicate complex technical concepts to stakeholders.

AWSPythonAWS EKSCloud ComputingClickhouseGoRustTroubleshooting

Posted 7 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 125000.0 - 135000.0 USD per year

🔍 Transaction and compliance software for state and local governments

🏢 Company: GovOS

  • At least 2 years of experience managing, troubleshooting, and optimizing Linux and Windows environments.
  • Demonstrated ability in various programming and scripting languages (e.g., Python, Bash, PowerShell).
  • Hands-on experience in designing, deploying, and maintaining cloud infrastructure in AWS or Azure.
  • In-depth knowledge of container technologies (Docker, Podman) and orchestration platforms (Kubernetes, ECS, AKS).
  • Strong expertise in version control systems (e.g., Git) and configuration management tools (e.g., Ansible, Terraform, Chef).
  • Experience in administering and optimizing databases (e.g., MySQL, PostgreSQL, MongoDB).

  • Enhance Developer Workflows: Design and implement productive developer workflows, focusing on automation.
  • Continuous Integration & Deployment: Build, optimize, and maintain CI/CD pipelines.
  • Environment Design: Collaborate on the design of production and non-production environments for scalability and reliability.
  • Data Security: Develop processes to protect customer data, aligning with best practices.
  • Incident Management: Participate in on-call rotations for system reliability.
  • Team Contribution: Support the team with other duties and initiatives.

AWSDockerPostgreSQLPythonBashGitJenkinsKubernetesMongoDBMySQLAzureLinuxTerraformAnsible

Posted 12 days ago
Apply
Apply

📍 Finland, Sweden, Germany, Denmark, Estonia

🧭 Full-Time

  • Proven experience in Software Engineering, SRE, or a similar role with a focus on observability, reliability, and scaling large systems.
  • Experience with OpenTelemetry, which is a key foundation for much of the infrastructure and tooling the team is converging on.
  • Strong foundation in computer science principles and engineering fundamentals.
  • Proficient in development, particularly in Go (preferred) or Python, with experience building automation tools and software for large-scale, distributed systems.
  • Hands-on experience with observability tooling such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger, and tracing systems.
  • Expertise in cloud platforms like AWS, GCP, or Azure, with experience managing cloud infrastructure using Kubernetes and containers (Docker).
  • Deep knowledge of building and maintaining reliable, high-performance, and scalable distributed systems.
  • Solid understanding of SRE principles, incident response, and designing fault-tolerant architectures.
  • Experience with infrastructure-as-code tools like Terraform or Ansible for managing cloud environments.
  • Familiarity with CI/CD pipelines, automated testing, and continuous delivery practices.
  • Strong analytical and problem-solving skills, with experience troubleshooting complex distributed systems.
  • Excellent communication and collaboration skills, with the ability to work cross-functionally to enhance platform observability and reliability.
  • Experience working directly with development teams, with a willingness to dive into application code for observability-related topics.
  • Solid experience with Docker and Kubernetes, coupled with a strong foundation in Unix systems and networking concepts.

  • Be responsible for building and improving our observability platform and tooling, used by all Wolt engineers.
  • Contribute to initiatives focused on architecting, building, and maintaining our observability stack to efficiently handle increasing telemetry data with greater reliability.
  • Champion observability best practices, guiding and supporting other Woltians in this space.
  • Take ownership of key initiatives to improve the quality, efficiency, and reliability of our observability stack.
  • Apply expertise in SRE culture and practices to ensure observability has a meaningful impact on the business.
  • Participate in the on-call rotation to address incidents and outages, resolving reliability issues efficiently.
  • Help standardize observability resources by building tools and documentation that enhance productivity and developer experience.
  • Triage and resolve production issues within the observability scope.
  • Contribute to open-source efforts by sharing some of our internal tools with the broader community.

AWSDockerPythonElasticSearchGCPKubernetesAzureGoGrafanaPrometheusCI/CDTerraformAnsible

Posted 12 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 108000.0 - 163900.0 USD per year

🔍 Active Insurance, Digital Risk Management

🏢 Company: Coalition, Inc.

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles in a full stack engineering environment.
  • Strong understanding of AWS services (e.g., EC2, S3, RDS, Lambda, VPC).
  • Hands-on experience with IaC tools like Terraform, CloudFormation, or CDK.
  • Experience with containerization and orchestration tools such as ECS and Kubernetes.
  • Experience working with fault-tolerant services and developing highly available systems.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Some knowledge of software engineering design patterns, agile development, and architecture principles.
  • Strong analytical and problem-solving skills.

  • Play a pivotal role in ensuring the performance, availability, and efficiency of cloud-based systems.
  • Design, implement, and manage robust cloud solutions.
  • Automate infrastructure and build developer-friendly platforms.
  • Optimize cloud resources and improve system observability.
  • Drive operational excellence across the organization.
  • Participate in a low-volume on-call rotation to maintain system reliability.

AWSDockerPythonKubernetesGoCI/CDTerraform

Posted 15 days ago
Apply
Apply

📍 United States, Canada

💸 108000.0 - 163900.0 USD per year

🔍 Insurance, Cybersecurity

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
  • Strong understanding of AWS services and best practices.
  • Hands-on experience with IaC tools like Terraform or CloudFormation.
  • Experience with containerization tools such as ECS or Kubernetes.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Strong analytical and problem-solving skills.

  • Design, implement, and manage robust cloud solutions.
  • Work closely with cross-functional teams.
  • Isolate, trap, and respond to system failures.
  • Develop strategies for continuous monitoring and analysis.
  • Participate in a low-volume on-call rotation to maintain reliability.

AWSDockerPythonJavaKafkaKubernetesGoCI/CDTerraform

Posted 15 days ago
Apply