Site Reliability Engineer

Posted 3 days agoViewed

View full description

📍 Location: France

🏢 Company: Sinch👥 1001-5000💰 $48,845,918 Post-IPO Debt 7 months agoMessaging SaaS Telecommunications Mobile Software

🗣️ Languages: English

🪄 Skills: DockerPostgreSQLPythonBashElasticSearchGCPKubernetesCassandraGrafanaPrometheusLinuxTerraformAnsible

Requirements:

Background in infrastructure, operations, or software engineering.
Experience with cloud providers such as GCP.
Proficiency in configuration management tools such as Terraform and Ansible.
Hands-on proficiency with modern monitoring tools like Prometheus and Grafana.
Experience with distributed data stores such as Cassandra, PostgreSQL, and ElasticSearch.
Experience with Python and Bash is beneficial.
Strong technical skills across various infrastructure technologies.
Strong communication skills.
Experience operating and maintaining production systems in a Linux and public cloud environment.

Responsibilities:

Partner with product engineering teams to identity systems requirements.
Build and support our cloud-based infrastructure.
Automate routine processes and remediation tasks.
Develop, monitor and track Service Level Objectives (SLOs) for the systems under management.
Proactively troubleshoot, resolve, and plan for issues that typically come from support staff, other engineering teams, and our automated monitoring system.
Ensure our datastores are healthy and operate at optimal performance levels.
Contribute to the growth and culture of our engineering team.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer

Posted 2 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 2 days ago

Apply

🔥 Site Reliability Engineer

Posted 3 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 4 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 4 days ago

Apply

🔥 Staff Site Reliability Engineer

Posted 9 days ago

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant about 5 years ago

🔧 Requirements

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).

💡 Responsibilities

Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 9 days ago

Apply

🔥 Site Reliability Engineer: Postgres

Posted 10 days ago

📍 Worldwide

🧭 Full-Time

🔍 Software Development

🏢 Company: Supabase👥 101-250💰 $80,000,000 Series C 6 months agoDatabase Developer Tools Artificial Intelligence (AI)Information Services Information Technology Software

🔧 Requirements

Experience in designing multi-tenant database solutions, designing for failover, fault-tolerance, and disaster recovery
Experience with orchestrating stateful workloads at scale or having used a Postgres operator like the ones from Zalando or Crunchy is a plus
Experience with tools in the Postgres ecosystem like pgbackrest, barman, Patroni, Stolon, etc
5+ years experience in SRE/DevOps/Cloud Infrastructure
3+ years of experience in building with Golang
Experience in managing large deployments on AWS
Knowledge of networking
Experience with Infrastructure as Code tools

💡 Responsibilities

Help build the Supabase Postgres offering.
Focus on improving the reliability of database backups and recovery
Implement high availability with minimal downtime failover
Help operationalize database management for our users by implementing maintenance windows, blue-green deployments as part of database upgrades, etc.
Help users self serve debug their databases by improving database observability
Improve the performance of provisioned Postgres databases and expose knobs for our users to further tune their database performance
Improve our system architecture to reduce costs while balancing security and performance.
Design CI/CD systems to speed up deployments with proper change and release management processes.
Escalated storage support tickets and sharing the on-call responsibility for the storage service.

AWSBackend DevelopmentDockerPostgreSQLSQLKubernetesCI/CDLinuxDevOpsTerraformNetworking

Posted 10 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 15 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 15 days ago

Apply

🔥 Site Reliability Engineer

Posted 16 days ago

📍 AMER, EMEA, APAC

🧭 Full-Time

🔍 Software Development

🏢 Company: asymmetric.re

🔧 Requirements

5+ years of experience
Excellent experience managing Linux and network infrastructure.
Experience with load balancers and other high-availability technologies (e.g., HAproxy, ALB/ELB, etc.)
Prior experience with configuration management tooling (eg. Ansible, Chef, Puppet, Saltstack, etc.)
Excellent troubleshooting fundamentals on both hardware and software.
Development experience in Golang, Python, or Rust.
Experience with continuous integration pipelines and automated deployments
Experience OSS monitoring tools (eg. Grafana, Loki, Prometheus, Alertmanager)

💡 Responsibilities

Manage a globally distributed fleet of blockchain infrastructure services
Deploy infrastructure as code deployments to both dev, staging, and production environments
Work in a globally distributed high performing team to deliver mission-critical services to the financial sector.
Design, Architect, Deploy, and Manage blockchain infrastructure services.
Adhere to the highest standards of integrity, trust, and professionalism.

AWSDockerPythonBlockchainCloud ComputingKubernetesRustCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Posted 16 days ago

Apply

🔥 Sr. Site Reliability Engineer

Posted 27 days ago

📍 U.S., EU

🧭 Full-Time

🔍 Software Development

🏢 Company: AuthZed👥 11-50💰 $12,000,000 Series A 9 months agoInformation Technology Cyber Security Software

🔧 Requirements

Proven experience as a Site Reliability Engineer or in a similar role.
Strong understanding of networking, operating systems, and cloud infrastructure.
Experience with Site Reliability Engineering, System Design, and Distributed Computing.
Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
Experience with containerization technologies such as Docker and Kubernetes.
Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
Experience working with Git and GitHub.
Experience with continuous integration and deployment systems.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration abilities.

💡 Responsibilities

Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
Automate infrastructure deployment and configuration management processes.
Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
Participate in on-call rotation and respond to production incidents in a timely manner.
Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 27 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Sanity👥 51-200💰 Corporate almost 3 years agoSoftware Development

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Familiarity with observability tools like Prometheus and similar stacks.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted about 1 month ago

Apply

🔥 Principal Site Reliability Engineer - EMEA

Posted about 1 month ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Ashby👥 51-100💰 $30,000,000 Series C 10 months agoManagement Information Systems Human Resources Recruiting Software

🔧 Requirements

You’ve built infrastructure at a slightly later stage than Ashby is at - you know how to deal with millions of data points, have seen great (or not great) infrastructure make or break customer experience, and have automated everything from provisioning to monitoring and release process.

💡 Responsibilities

Optimize our homegrown ultra-dynamic recruiting DSL-to-SQL compiler
Create automated guardrails for the security and privacy of our customer data
Help our developers ship features fast through canary deploys, gradual rollouts and feature flags, while keeping complexity manageable and reducing downtime
Work with the business and the engineering team to define SLOs and implement the corresponding SLIs.
Ensure all communication with external services supports retries and circuit-breakers.
Implement the infrastructure to support an event-driven architecture and data warehouse.

AWSBackend DevelopmentGraphQLNode.jsPostgreSQLSoftware DevelopmentSQLCloud ComputingGitKubernetesReact.jsTypeScriptAlgorithmsData StructuresREST APIRedisCI/CDLinuxDevOpsMicroservicesScriptingSoftware EngineeringDebugging

Posted about 1 month ago

Apply