Apply

Site Reliability Engineer

Posted 4 months agoViewed

View full description

πŸ’Ž Seniority level: Middle, 4+ years

πŸ“ Location: Europe, South Africa, Egypt, Latin America

πŸ” Industry: Online Gaming

πŸ—£οΈ Languages: English

⏳ Experience: 4+ years

πŸͺ„ Skills: AWSDockerPythonKubernetesGrafanaPrometheus

Requirements:
  • 4+ years experience in SRE or DevOps
  • Veteran in AWS technologies
  • Experience deploying into new regions
  • Managed multiple Kubernetes clusters
Responsibilities:
  • Plan and securely deploy into new regions
  • Improve all aspects of AWS infrastructure
  • Monitor all releases for smooth operations
  • Manage multiple K8s clusters
  • Research and implement new technology
Apply

Related Jobs

Apply

πŸ“ France

🧭 Full-Time

πŸ’Έ 50000.0 - 75000.0 EUR per year

πŸ” Speech AI

🏒 Company: GladiaπŸ‘₯ 11-50Digital MarketingSEOE-CommerceBrand MarketingAppsInformation TechnologyWeb Design

  • At least 5+ years of experience working on a rapidly growing product, with a strong focus on scalability and well-tested solutions
  • Strong experience with PromQL, OpenTelemetry, and self-hosted stacks
  • Proficiency with Kubernetes and containerization
  • Experience with CI/CD processes (GitHub, test-driven development, etc.)
  • Knowledge of at least one programming language (Python, Go, etc.)
  • Knowledge of databases (PostgreSQL, Patroni)
  • Experience with UNIX/Linux operating systems
  • Networking knowledge (DNS, OSI model, HTTP/HTTPS, SSL/TLS)
  • Create and maintain hybrid Kubernetes clusters
  • Implement and manage the observability stack (CNCF landscape)
  • Prepare deployments for production
  • Optimize infrastructure and tool scaling to keep costs low
  • Support developers in implementing observability
  • Document technical procedures and policies

DockerPostgreSQLPythonGitKubernetesGrafanaPrometheusCI/CDLinuxNetworkingAnsible

Posted 3 days ago
Apply
Apply

πŸ“ Poland

πŸ” Software Development

  • Extensive experience with enterprise scale continuous delivery environments
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with sustainable incident response in a blameless environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 3 days ago
Apply
Apply

πŸ“ United Kingdom, Canada

πŸ” Software Development

🏒 Company: GoDaddyπŸ‘₯ 5001-10000πŸ’° $800,000,000 Post-IPO Equity about 3 years agoπŸ«‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 8 days ago
Apply
Apply

πŸ“ Latin America

🧭 Full-Time

πŸ” Game Development

🏒 Company: Argus Labs

  • 4+ years of DevOps experience.
  • Ability to design and implement highly available and reliable systems.
  • Proven experience with Linux, Docker, and cloud technologies such as AWS, GCP, and Azure.
  • Extensive experience setting up and maintaining database infrastructure, including Postgres, Terraform, NoSQL, MongoDB, and DocumentDB.
  • Experience with DevOps tools such as Terraform, Ansible, Kubernetes, Redis, Jenkins. etc.
  • Knowledge of CI/CD.
  • Excellent communication and time management skills.
  • Work closely with stakeholders company-wide to provide services that enhance the user experience for the development team, as well as our end-users.
  • Design and build operational infrastructure to support games, automating where possible.
  • Spearhead company-wide security culture and architecture to keep our platform secure.
  • Own delivery, scalability, and reliability of our backend infrastructure.
  • Advise and collaborate with the rest of the engineering team to ensure we are building safe, secure, and reliable products.

AWSDockerPostgreSQLCloud ComputingGCPJenkinsKubernetesMongoDBAzureRedisNosqlCI/CDLinuxDevOpsTerraformAnsible

Posted 16 days ago
Apply
Apply

πŸ“ United States, European timezones

🧭 Full-Time

πŸ” Software Development

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure management
  • Knowledge of CI/CD processes
  • Experience with incident management
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: SanityπŸ‘₯ 51-200πŸ’° Corporate over 2 years agoSoftware Development

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Analytical mindset for designing, diagnosing, and optimizing infrastructure.
  • Skilled in managing scalable, highly available, cloud-based applications.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Proficient in building and maintaining CI/CD pipelines.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
  • Open-minded yet discerning when it comes to exploring new technologies.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 18 days ago
Apply
Apply

πŸ“ Americas

🧭 Full-Time

πŸ’Έ 160000.0 - 180000.0 USD per year

πŸ” Software Development

🏒 Company: Customer.ioπŸ‘₯ 251-500πŸ’° Series A about 3 years agoDigital MediaSaaSProduct SearchSoftware

  • 7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
  • Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
  • Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
  • Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
  • Deep knowledge of UNIX environments and modern collaborative development practices.
  • Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
  • Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
  • Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
  • A strong focus on proactive identification and resolving issues in production environments.
  • A self-starter who thrives in both synchronous and asynchronous work environments.
  • Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
  • Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
  • Streamline and automate processes for efficiency and reliability, removing manual toil.
  • Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
  • Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
  • Troubleshoot and resolve production issues across various services and stack levels.
  • Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
  • Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote employment solutions

🏒 Company: Remote - Referral Board

  • Significant and demonstrated experience as a Senior Site Reliability Engineer.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools, with a preference for GitLab CI.
  • Experience with a back-end programming language such as Elixir, Clojure, Java, Node.js, or Python.
  • Experience in a programming language used for developing SRE tooling, like Go or Python.
  • Experience running and configuring Linux systems in non-cloud environments.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving existing infrastructure.
  • Helping build the next generation of the platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to enhance scalability, stability, and reliability.

AWSPythonKubernetesGoLinuxTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote Employment and Compliance Solutions

🏒 Company: RemoteπŸ‘₯ 1001-5000πŸ’° $300,000,000 Series C almost 3 years agoπŸ«‚ Last layoff over 2 years agoHuman Resources Services

  • Significant and demonstrated experience as a Senior Site Reliability Engineer, which includes architecting, implementing, and maintaining a Platform for other teams.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools (GitLab CI is preferred).
  • Experience with a back-end programming language (Elixir, Clojure, Java, Node.js, Python, etc.).
  • Experience with a programming language for SRE tooling (Go, Python).
  • Experience running and configuring Linux systems in a non-cloud environment.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving our existing infrastructure.
  • Helping build the next generation of our platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to improve overall scalability, stability, and reliability.

AWSPythonKubernetesGoCI/CDLinuxTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ Argentina, Brazil

🧭 Full-Time

πŸ’Έ 65000.0 - 90000.0 USD per year

πŸ” Cybersecurity

🏒 Company: SecurityScorecardπŸ‘₯ 251-500πŸ’° $180,000,000 Series E almost 4 years agoSecurityRisk ManagementCyber SecuritySoftware

  • Proven experience as an SRE, DevOps Engineer, or similar role
  • Strong background in CI/CD tools (Jenkins, GitHub Actions, etc.)
  • Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Docker, Kubernetes)
  • Proficiency with infrastructure as code tools (Terraform, Ansible)
  • Experience with automated testing frameworks (Selenium, JUnit)
  • Knowledge of scripting languages (Python, Bash)
  • Familiarity with monitoring and observability tools (Prometheus, Grafana)
  • Design, implement, and maintain CI/CD pipelines
  • Enhance infrastructure as code practices
  • Optimize deployment rollbacks and improve incident response
  • Develop automated testing strategies
  • Collaborate with developers for application reliability
  • Build monitoring and alerting solutions
  • Drive improvements in observability and metrics collection
  • Participate in on-call rotations

AWSDockerPythonBashJUNITKubernetesGrafanaPrometheusSeleniumCI/CDTerraform

Posted about 1 month ago
Apply