Apply

Site Reliability Engineer

Posted 3 days agoViewed

View full description

πŸ“ Location: France

🏒 Company: SinchπŸ‘₯ 1001-5000πŸ’° $48,845,918 Post-IPO Debt 7 months agoMessagingSaaSTelecommunicationsMobileSoftware

πŸ—£οΈ Languages: English

πŸͺ„ Skills: DockerPostgreSQLPythonBashElasticSearchGCPKubernetesCassandraGrafanaPrometheusLinuxTerraformAnsible

Requirements:
  • Background in infrastructure, operations, or software engineering.
  • Experience with cloud providers such as GCP.
  • Proficiency in configuration management tools such as Terraform and Ansible.
  • Hands-on proficiency with modern monitoring tools like Prometheus and Grafana.
  • Experience with distributed data stores such as Cassandra, PostgreSQL, and ElasticSearch.
  • Experience with Python and Bash is beneficial.
  • Strong technical skills across various infrastructure technologies.
  • Strong communication skills.
  • Experience operating and maintaining production systems in a Linux and public cloud environment.
Responsibilities:
  • Partner with product engineering teams to identity systems requirements.
  • Build and support our cloud-based infrastructure.
  • Automate routine processes and remediation tasks.
  • Develop, monitor and track Service Level Objectives (SLOs) for the systems under management.
  • Proactively troubleshoot, resolve, and plan for issues that typically come from support staff, other engineering teams, and our automated monitoring system.
  • Ensure our datastores are healthy and operate at optimal performance levels.
  • Contribute to the growth and culture of our engineering team.
Apply

Related Jobs

Apply

πŸ“ France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: Platform.shπŸ‘₯ 251-500πŸ’° $140,000,000 Series D almost 3 years agoInternetOpen SourcePaaSCloud ManagementSoftware

  • DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
  • Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
  • Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
  • Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
  • Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
  • Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
  • Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 2 days ago
Apply
Apply

πŸ“ France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: Remote Woman

  • A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability
  • Automate Deployments and Workflows
  • Optimize CI/CD Pipelines
  • Cloud Infrastructure Management
  • Incident Response and Post-Mortem
  • Collaborate with Cross-Functional Teams
  • Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 3 days ago
Apply
Apply

πŸ“ Americas, EU, UK

πŸ” Cryptocurrency

🏒 Company: AurosπŸ‘₯ 11-50πŸ’° $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 4 days ago
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 129347.0 - 200824.0 USD per year

πŸ” Software Development

🏒 Company: Wikimedia FoundationπŸ‘₯ 251-500πŸ’° $2,100,000 Grant about 5 years ago

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
  • Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
  • Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
  • Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
  • Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 9 days ago
Apply
Apply

πŸ“ Worldwide

🧭 Full-Time

πŸ” Software Development

🏒 Company: SupabaseπŸ‘₯ 101-250πŸ’° $80,000,000 Series C 6 months agoDatabaseDeveloper ToolsArtificial Intelligence (AI)Information ServicesInformation TechnologySoftware

  • Experience in designing multi-tenant database solutions, designing for failover, fault-tolerance, and disaster recovery
  • Experience with orchestrating stateful workloads at scale or having used a Postgres operator like the ones from Zalando or Crunchy is a plus
  • Experience with tools in the Postgres ecosystem like pgbackrest, barman, Patroni, Stolon, etc
  • 5+ years experience in SRE/DevOps/Cloud Infrastructure
  • 3+ years of experience in building with Golang
  • Experience in managing large deployments on AWS
  • Knowledge of networking
  • Experience with Infrastructure as Code tools
  • Help build the Supabase Postgres offering.
  • Focus on improving the reliability of database backups and recovery
  • Implement high availability with minimal downtime failover
  • Help operationalize database management for our users by implementing maintenance windows, blue-green deployments as part of database upgrades, etc.
  • Help users self serve debug their databases by improving database observability
  • Improve the performance of provisioned Postgres databases and expose knobs for our users to further tune their database performance
  • Improve our system architecture to reduce costs while balancing security and performance.
  • Design CI/CD systems to speed up deployments with proper change and release management processes.
  • Escalated storage support tickets and sharing the on-call responsibility for the storage service.

AWSBackend DevelopmentDockerPostgreSQLSQLKubernetesCI/CDLinuxDevOpsTerraformNetworking

Posted 10 days ago
Apply
Apply

πŸ“ United States, Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: DuneπŸ‘₯ 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 15 days ago
Apply
Apply

πŸ“ AMER, EMEA, APAC

🧭 Full-Time

πŸ” Software Development

🏒 Company: asymmetric.re

  • 5+ years of experience
  • Excellent experience managing Linux and network infrastructure.
  • Experience with load balancers and other high-availability technologies (e.g., HAproxy, ALB/ELB, etc.)
  • Prior experience with configuration management tooling (eg. Ansible, Chef, Puppet, Saltstack, etc.)
  • Excellent troubleshooting fundamentals on both hardware and software.
  • Development experience in Golang, Python, or Rust.
  • Experience with continuous integration pipelines and automated deployments
  • Experience OSS monitoring tools (eg. Grafana, Loki, Prometheus, Alertmanager)
  • Manage a globally distributed fleet of blockchain infrastructure services
  • Deploy infrastructure as code deployments to both dev, staging, and production environments
  • Work in a globally distributed high performing team to deliver mission-critical services to the financial sector.
  • Design, Architect, Deploy, and Manage blockchain infrastructure services.
  • Adhere to the highest standards of integrity, trust, and professionalism.

AWSDockerPythonBlockchainCloud ComputingKubernetesRustCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Posted 16 days ago
Apply
Apply

πŸ“ U.S., EU

🧭 Full-Time

πŸ” Software Development

🏒 Company: AuthZedπŸ‘₯ 11-50πŸ’° $12,000,000 Series A 9 months agoInformation TechnologyCyber SecuritySoftware

  • Proven experience as a Site Reliability Engineer or in a similar role.
  • Strong understanding of networking, operating systems, and cloud infrastructure.
  • Experience with Site Reliability Engineering, System Design, and Distributed Computing.
  • Experience in various programming languages β€” we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
  • Experience working with Git and GitHub.
  • Experience with continuous integration and deployment systems.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration abilities.
  • Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
  • Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
  • Automate infrastructure deployment and configuration management processes.
  • Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
  • Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
  • Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
  • Participate in on-call rotation and respond to production incidents in a timely manner.
  • Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 27 days ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: SanityπŸ‘₯ 51-200πŸ’° Corporate almost 3 years agoSoftware Development

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted about 1 month ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: AshbyπŸ‘₯ 51-100πŸ’° $30,000,000 Series C 10 months agoManagement Information SystemsHuman ResourcesRecruitingSoftware

You’ve built infrastructure at a slightly later stage than Ashby is at - you know how to deal with millions of data points, have seen great (or not great) infrastructure make or break customer experience, and have automated everything from provisioning to monitoring and release process.
  • Optimize our homegrown ultra-dynamic recruiting DSL-to-SQL compiler
  • Create automated guardrails for the security and privacy of our customer data
  • Help our developers ship features fast through canary deploys, gradual rollouts and feature flags, while keeping complexity manageable and reducing downtime
  • Work with the business and the engineering team to define SLOs and implement the corresponding SLIs.
  • Ensure all communication with external services supports retries and circuit-breakers.
  • Implement the infrastructure to support an event-driven architecture and data warehouse.

AWSBackend DevelopmentGraphQLNode.jsPostgreSQLSoftware DevelopmentSQLCloud ComputingGitKubernetesReact.jsTypeScriptAlgorithmsData StructuresREST APIRedisCI/CDLinuxDevOpsMicroservicesScriptingSoftware EngineeringDebugging

Posted about 1 month ago
Apply