Senior Site Reliability Engineer

Posted about 1 month agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: Europe

🔍 Industry: Software Development

🏢 Company: Sanity👥 51-200💰 Corporate almost 3 years agoSoftware Development

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Requirements:

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Familiarity with observability tools like Prometheus and similar stacks.

Responsibilities:

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 10 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 21 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago

Apply

🔥 Senior Site Reliability Engineer - (Remote - Europe)

Posted 22 days ago

📍 Germany, Spain, Portugal

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

🔧 Requirements

5+ years of experience in a Site Reliability Engineer or similar role.
3+ years of experience with AWS services and container orchestration tools.
2+ years of Kubernetes experience.
Strong knowledge of observability tools and principles (monitoring, logging, tracing).
Hands-on experience with Terraform for infrastructure as code.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Experience in incident management, postmortem analysis, and risk mitigation.
Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.

💡 Responsibilities

Develop and maintain systems that are reliable, scalable, and efficient.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
Automate operational tasks, incident responses, and contribute to system performance optimizations.
Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
Continuously evaluate and improve system performance, capacity, and cost efficiency.
Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer (SRE) for Release Engineering (remote-only)

Posted 23 days ago

📍 Cyprus, Montenegro, Georgia, Serbia, Poland

🔍 Software Development

🏢 Company: Cloudlinux

🔧 Requirements

Strong background in development: an ideal candidate had started a career as a developer, then rolled to infrastructure-based projects on a large scale.
Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.

💡 Responsibilities

Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
Automate deployment processes using Infrastructure as Code (IaC) principles.
Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".

PythonBashCloud ComputingKubernetesNginxGrafanaPrometheusRelease ManagementCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 23 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 24 days ago

📍 United Kingdom

🧭 Full-Time

🔍 Software Development

🏢 Company: StarRez👥 251-500💰 Private about 3 years agoConsulting SaaS Property Management Software

🔧 Requirements

1+ years experience working on a SaaS platform
Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
Proficiency in at least one (or more) object-oriented programming language (C# preferable)
Production experience operating containerization technologies (Kubernetes).
Proficiency with one or more public cloud providers such as Azure, AWS or GCP
Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
Proven track record of maintaining highly-available and performant production environments.
Ability to identify and implement effective mitigation strategies and operational playbooks.

💡 Responsibilities

Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
Participate in on-call rotations to ensure system reliability and rapid incident response.
Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
Conduct performance tests to identify and remediate bottlenecks
Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
Monitor, review and tune databases to ensure high availability and performance
Collaborate with product engineering teams to design/build fit-for-purpose and observable software
Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 24 days ago

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted about 1 month ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity over 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

A track record of delivering capabilities that build customer value and business impact.
Knowledge of principles for building performant and quality REST APIs.
Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.

💡 Responsibilities

Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer [United Kingdom]

Posted 2 months ago

📍 United Kingdom

🧭 Contract

🔍 SaaS platform accelerating digital transformation in the restaurant industry

🔧 Requirements

5+ years of professional experience building scalable, efficient, and resilient systems.
Experience with monitoring tools like Datadog, Sumo Logic, Raygun, New Relic, Grafana, CloudWatch, and Splunk SignalFx.
Fluency in Incident Management using tools such as FireHydrant, OpsGenie, PagerDuty, VictorOps, or similar.
Experience with build and deploy tools (ie. Jenkins, TeamCity, Octopus, or CircleCI).
Prior hands-on software development experience.

💡 Responsibilities

Guide observability and SLIs/SLOs to Incident Response to postmortems and follow-up actions.
Implement and tailor our incident response tools to minimize outage durations.
Build collaborative monitoring solutions with members across multiple product teams.
Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
Rethink our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
Influence an engineering culture of reliability, observability, and availability.
Participate in an Incident Commander on-call rotation to help drive remediation efforts to improve our user experience through incidents across our Platform.
Mentor engineering teams through game days, SRE boot camps and other training and feedback channels.

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 months ago

📍 Poland, Germany, United Kingdom

🔍 Artificial Intelligence and Data Science

🏢 Company: Mozn

🔧 Requirements

BSc/BA in Computer Engineering, Computer Science, or related discipline.
5 years of experience in a similar position (SRE, DevOps, or infrastructure engineering).
Professional certifications are appreciated.
Solid experience with container runtimes and orchestrators: Docker and Kubernetes.
Experience with at least one major cloud provider: AWS, Azure, GCP, or Oracle.
Preferred programming languages for infrastructure as code: Python and Golang.
Experience with Linux servers and competency in bash scripting.
Experience with Infrastructure as Code.
Experience with automating deployment pipelines.
Solid foundation in networking.
Knowledge of big data platforms like Kafka, Hadoop, and Spark is a plus.
Knowledge of SQL and SQL database management is a plus.
Knowledge of Terraform or Ansible is a plus.

💡 Responsibilities

Mixture of software engineering, system architecture design, and operation.
Attend morning meetings and sprint planning as an SRE team member.
Help design, build, support, and scale cloud and on-premise infrastructure.
Implement monitoring, alerting, and debugging for infrastructure.
Design and implement CI/CD workflows with best practices.
Maintain data stores including load monitoring and backup plans.
Collaborate with other departments to address their use cases.
Explore new technologies to improve the current stack.
Install and configure servers and network equipment using Infrastructure as Code.
Practice sustainable incident response and blameless postmortems.

AWSDockerPythonSQLBashHadoopKafkaKubernetesSparkCI/CDTerraformAnsible

Posted 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 months ago

📍 Spain

🧭 Full-Time

🔍 Mobility services

🏢 Company: Cabify👥 1001-5000💰 $16,473,668 Debt Financing about 1 year agoInternet Logistics Ride Sharing Transportation Mobile

🔧 Requirements

Strong knowledge of Unix, networking stack, OSI model, containers, and monitoring.
Programming skills in at least one language; capability to learn others.
Natural tendency to automate tasks.
Effective and asynchronous communication skills.
Care for the company, team, and self.
Embrace diversity and humility.
Action-oriented and iterative problem solving.
Preference for simplicity over complexity.
Ability to identify and address bottlenecks.
Proficiency in English communication.

💡 Responsibilities

Evolving our infrastructure platform building self-service components.
Working closely with Product and Infrastructure teams to develop infrastructure components.
Designing and implementing tooling for service availability, scalability, observability, and latency improvements.
Increasing reliability awareness with teams and reviewing implementations.
Defining SLIs, SLOs and SLAs as part of services' lifecycle.
Sharing an on-call schedule for owned platform services.
Solving problems in a highly available platform and building automations to prevent incidents.
Participating in the recruiting process to grow the engineering team.

AWSAWS EKSKubernetesMicroservicesNetworking

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 5 months ago

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

🔧 Requirements

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

💡 Responsibilities

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 5 months ago

Apply