Site Reliability Engineer

Posted 4 months agoViewed

View full description

💎 Seniority level: Middle, 4+ years

📍 Location: Europe, South Africa, Egypt, Latin America

🔍 Industry: Online Gaming

🗣️ Languages: English

⏳ Experience: 4+ years

🪄 Skills: AWSDockerPythonKubernetesGrafanaPrometheus

Requirements:

4+ years experience in SRE or DevOps
Veteran in AWS technologies
Experience deploying into new regions
Managed multiple Kubernetes clusters

Responsibilities:

Plan and securely deploy into new regions
Improve all aspects of AWS infrastructure
Monitor all releases for smooth operations
Manage multiple K8s clusters
Research and implement new technology

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer

Posted 3 days ago

📍 France

🧭 Full-Time

💸 50000.0 - 75000.0 EUR per year

🔍 Speech AI

🏢 Company: Gladia👥 11-50 Digital Marketing SEO E-Commerce Brand Marketing Apps Information Technology Web Design

🔧 Requirements

At least 5+ years of experience working on a rapidly growing product, with a strong focus on scalability and well-tested solutions
Strong experience with PromQL, OpenTelemetry, and self-hosted stacks
Proficiency with Kubernetes and containerization
Experience with CI/CD processes (GitHub, test-driven development, etc.)
Knowledge of at least one programming language (Python, Go, etc.)
Knowledge of databases (PostgreSQL, Patroni)
Experience with UNIX/Linux operating systems
Networking knowledge (DNS, OSI model, HTTP/HTTPS, SSL/TLS)

💡 Responsibilities

Create and maintain hybrid Kubernetes clusters
Implement and manage the observability stack (CNCF landscape)
Prepare deployments for production
Optimize infrastructure and tool scaling to keep costs low
Support developers in implementing observability
Document technical procedures and policies

DockerPostgreSQLPythonGitKubernetesGrafanaPrometheusCI/CDLinuxNetworkingAnsible

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer (SRE) - Poland

Posted 3 days ago

📍 Poland

🔍 Software Development

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with sustainable incident response in a blameless environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted 8 days ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity about 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

A track record of delivering capabilities that build customer value and business impact.
Knowledge of principles for building performant and quality REST APIs.
Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.

💡 Responsibilities

Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 8 days ago

Apply

🔥 Site Reliability Engineer (LATAM)

Posted 16 days ago

📍 Latin America

🧭 Full-Time

🔍 Game Development

🏢 Company: Argus Labs

🔧 Requirements

4+ years of DevOps experience.
Ability to design and implement highly available and reliable systems.
Proven experience with Linux, Docker, and cloud technologies such as AWS, GCP, and Azure.
Extensive experience setting up and maintaining database infrastructure, including Postgres, Terraform, NoSQL, MongoDB, and DocumentDB.
Experience with DevOps tools such as Terraform, Ansible, Kubernetes, Redis, Jenkins. etc.
Knowledge of CI/CD.
Excellent communication and time management skills.

💡 Responsibilities

Work closely with stakeholders company-wide to provide services that enhance the user experience for the development team, as well as our end-users.
Design and build operational infrastructure to support games, automating where possible.
Spearhead company-wide security culture and architecture to keep our platform secure.
Own delivery, scalability, and reliability of our backend infrastructure.
Advise and collaborate with the rest of the engineering team to ensure we are building safe, secure, and reliable products.

AWSDockerPostgreSQLCloud ComputingGCPJenkinsKubernetesMongoDBAzureRedisNosqlCI/CDLinuxDevOpsTerraformAnsible

Posted 16 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure management
Knowledge of CI/CD processes
Experience with incident management

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Sanity👥 51-200💰 Corporate over 2 years agoSoftware Development

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Analytical mindset for designing, diagnosing, and optimizing infrastructure.
Skilled in managing scalable, highly available, cloud-based applications.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Proficient in building and maintaining CI/CD pipelines.
Familiarity with observability tools like Prometheus and similar stacks.
Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
Open-minded yet discerning when it comes to exploring new technologies.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer - Americas

Posted 29 days ago

📍 Americas

🧭 Full-Time

💸 160000.0 - 180000.0 USD per year

🔍 Software Development

🏢 Company: Customer.io👥 251-500💰 Series A about 3 years agoDigital Media SaaS Product Search Software

🔧 Requirements

7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
Deep knowledge of UNIX environments and modern collaborative development practices.
Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
A strong focus on proactive identification and resolving issues in production environments.
A self-starter who thrives in both synchronous and asynchronous work environments.

💡 Responsibilities

Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
Streamline and automate processes for efficiency and reliability, removing manual toil.
Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
Troubleshoot and resolve production issues across various services and stack levels.
Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 LATAM

🧭 Full-Time

💸 51850.0 - 116650.0 USD per year

🔍 Remote employment solutions

🏢 Company: Remote - Referral Board

🔧 Requirements

Significant and demonstrated experience as a Senior Site Reliability Engineer.
Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
Knowledge of CI/CD tools, with a preference for GitLab CI.
Experience with a back-end programming language such as Elixir, Clojure, Java, Node.js, or Python.
Experience in a programming language used for developing SRE tooling, like Go or Python.
Experience running and configuring Linux systems in non-cloud environments.
Security knowledge from both defensive and offensive perspectives.
Excellent communication and interpersonal skills.

💡 Responsibilities

Managing and improving existing infrastructure.
Helping build the next generation of the platform using tools like Kubernetes, Terraform, and Docker.
Streamlining and automating deployment processes.
Working closely with the Security team to address potential threats and patches.
Supporting engineers and product teams to enhance scalability, stability, and reliability.

AWSPythonKubernetesGoLinuxTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 LATAM

🧭 Full-Time

💸 51850.0 - 116650.0 USD per year

🔍 Remote Employment and Compliance Solutions

🏢 Company: Remote👥 1001-5000💰 $300,000,000 Series C almost 3 years ago🫂 Last layoff over 2 years agoHuman Resources Services

🔧 Requirements

Significant and demonstrated experience as a Senior Site Reliability Engineer, which includes architecting, implementing, and maintaining a Platform for other teams.
Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
Knowledge of CI/CD tools (GitLab CI is preferred).
Experience with a back-end programming language (Elixir, Clojure, Java, Node.js, Python, etc.).
Experience with a programming language for SRE tooling (Go, Python).
Experience running and configuring Linux systems in a non-cloud environment.
Security knowledge from both defensive and offensive perspectives.
Excellent communication and interpersonal skills.

💡 Responsibilities

Managing and improving our existing infrastructure.
Helping build the next generation of our platform using tools like Kubernetes, Terraform, and Docker.
Streamlining and automating deployment processes.
Working closely with the Security team to address potential threats and patches.
Supporting engineers and product teams to improve overall scalability, stability, and reliability.

AWSPythonKubernetesGoCI/CDLinuxTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 Argentina, Brazil

🧭 Full-Time

💸 65000.0 - 90000.0 USD per year

🔍 Cybersecurity

🏢 Company: SecurityScorecard👥 251-500💰 $180,000,000 Series E almost 4 years agoSecurity Risk Management Cyber Security Software

🔧 Requirements

Proven experience as an SRE, DevOps Engineer, or similar role
Strong background in CI/CD tools (Jenkins, GitHub Actions, etc.)
Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Docker, Kubernetes)
Proficiency with infrastructure as code tools (Terraform, Ansible)
Experience with automated testing frameworks (Selenium, JUnit)
Knowledge of scripting languages (Python, Bash)
Familiarity with monitoring and observability tools (Prometheus, Grafana)

💡 Responsibilities

Design, implement, and maintain CI/CD pipelines
Enhance infrastructure as code practices
Optimize deployment rollbacks and improve incident response
Develop automated testing strategies
Collaborate with developers for application reliability
Build monitoring and alerting solutions
Drive improvements in observability and metrics collection
Participate in on-call rotations

AWSDockerPythonBashJUNITKubernetesGrafanaPrometheusSeleniumCI/CDTerraform

Posted about 1 month ago

Apply