Apply

Senior Site Reliability Engineer

Posted about 1 month agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ“ Location: Europe

πŸ” Industry: Software Development

🏒 Company: SanityπŸ‘₯ 51-200πŸ’° Corporate almost 3 years agoSoftware Development

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

πŸͺ„ Skills: PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Requirements:
  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Familiarity with observability tools like Prometheus and similar stacks.
Responsibilities:
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.
Apply

Related Jobs

Apply

πŸ“ Americas, EU, UK

πŸ” Cryptocurrency

🏒 Company: AurosπŸ‘₯ 11-50πŸ’° $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago
Apply
Apply

πŸ“ United States, Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: DuneπŸ‘₯ 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago
Apply
Apply

πŸ“ Germany, Spain, Portugal

🏒 Company: JobgetherπŸ‘₯ 11-50πŸ’° $1,493,585 Seed about 2 years agoInternet

  • 5+ years of experience in a Site Reliability Engineer or similar role.
  • 3+ years of experience with AWS services and container orchestration tools.
  • 2+ years of Kubernetes experience.
  • Strong knowledge of observability tools and principles (monitoring, logging, tracing).
  • Hands-on experience with Terraform for infrastructure as code.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience in incident management, postmortem analysis, and risk mitigation.
  • Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
  • Develop and maintain systems that are reliable, scalable, and efficient.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
  • Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
  • Automate operational tasks, incident responses, and contribute to system performance optimizations.
  • Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
  • Continuously evaluate and improve system performance, capacity, and cost efficiency.
  • Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 22 days ago
Apply
Apply

πŸ“ Cyprus, Montenegro, Georgia, Serbia, Poland

πŸ” Software Development

🏒 Company: Cloudlinux

  • Strong background in development: an ideal candidate had started a career as a developer, then rolled to infrastructure-based projects on a large scale.
  • Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
  • Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
  • Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
  • Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
  • Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
  • Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.
  • Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
  • Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
  • Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
  • Automate deployment processes using Infrastructure as Code (IaC) principles.
  • Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".

PythonBashCloud ComputingKubernetesNginxGrafanaPrometheusRelease ManagementCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 23 days ago
Apply
Apply

πŸ“ United Kingdom

🧭 Full-Time

πŸ” Software Development

🏒 Company: StarRezπŸ‘₯ 251-500πŸ’° Private about 3 years agoConsultingSaaSProperty ManagementSoftware

  • 1+ years experience working on a SaaS platform
  • Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
  • Proficiency in at least one (or more) object-oriented programming language (C# preferable)
  • Production experience operating containerization technologies (Kubernetes).
  • Proficiency with one or more public cloud providers such as Azure, AWS or GCP
  • Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
  • Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
  • Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
  • Proven track record of maintaining highly-available and performant production environments.
  • Ability to identify and implement effective mitigation strategies and operational playbooks.
  • Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
  • Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
  • Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
  • Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
  • Participate in on-call rotations to ensure system reliability and rapid incident response.
  • Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
  • Conduct performance tests to identify and remediate bottlenecks
  • Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
  • Monitor, review and tune databases to ensure high availability and performance
  • Collaborate with product engineering teams to design/build fit-for-purpose and observable software
  • Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 24 days ago
Apply
Apply

πŸ“ United Kingdom, Canada

πŸ” Software Development

🏒 Company: GoDaddyπŸ‘₯ 5001-10000πŸ’° $800,000,000 Post-IPO Equity over 3 years agoπŸ«‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted about 1 month ago
Apply
Apply

πŸ“ United Kingdom

🧭 Contract

πŸ” SaaS platform accelerating digital transformation in the restaurant industry

  • 5+ years of professional experience building scalable, efficient, and resilient systems.
  • Experience with monitoring tools like Datadog, Sumo Logic, Raygun, New Relic, Grafana, CloudWatch, and Splunk SignalFx.
  • Fluency in Incident Management using tools such as FireHydrant, OpsGenie, PagerDuty, VictorOps, or similar.
  • Experience with build and deploy tools (ie. Jenkins, TeamCity, Octopus, or CircleCI).
  • Prior hands-on software development experience.
  • Guide observability and SLIs/SLOs to Incident Response to postmortems and follow-up actions.
  • Implement and tailor our incident response tools to minimize outage durations.
  • Build collaborative monitoring solutions with members across multiple product teams.
  • Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
  • Rethink our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
  • Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
  • Influence an engineering culture of reliability, observability, and availability.
  • Participate in an Incident Commander on-call rotation to help drive remediation efforts to improve our user experience through incidents across our Platform.
  • Mentor engineering teams through game days, SRE boot camps and other training and feedback channels.

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 2 months ago
Apply
Apply

πŸ“ Poland, Germany, United Kingdom

πŸ” Artificial Intelligence and Data Science

🏒 Company: Mozn

  • BSc/BA in Computer Engineering, Computer Science, or related discipline.
  • 5 years of experience in a similar position (SRE, DevOps, or infrastructure engineering).
  • Professional certifications are appreciated.
  • Solid experience with container runtimes and orchestrators: Docker and Kubernetes.
  • Experience with at least one major cloud provider: AWS, Azure, GCP, or Oracle.
  • Preferred programming languages for infrastructure as code: Python and Golang.
  • Experience with Linux servers and competency in bash scripting.
  • Experience with Infrastructure as Code.
  • Experience with automating deployment pipelines.
  • Solid foundation in networking.
  • Knowledge of big data platforms like Kafka, Hadoop, and Spark is a plus.
  • Knowledge of SQL and SQL database management is a plus.
  • Knowledge of Terraform or Ansible is a plus.
  • Mixture of software engineering, system architecture design, and operation.
  • Attend morning meetings and sprint planning as an SRE team member.
  • Help design, build, support, and scale cloud and on-premise infrastructure.
  • Implement monitoring, alerting, and debugging for infrastructure.
  • Design and implement CI/CD workflows with best practices.
  • Maintain data stores including load monitoring and backup plans.
  • Collaborate with other departments to address their use cases.
  • Explore new technologies to improve the current stack.
  • Install and configure servers and network equipment using Infrastructure as Code.
  • Practice sustainable incident response and blameless postmortems.

AWSDockerPythonSQLBashHadoopKafkaKubernetesSparkCI/CDTerraformAnsible

Posted 2 months ago
Apply
Apply

πŸ“ Spain

🧭 Full-Time

πŸ” Mobility services

🏒 Company: CabifyπŸ‘₯ 1001-5000πŸ’° $16,473,668 Debt Financing about 1 year agoInternetLogisticsRide SharingTransportationMobile

  • Strong knowledge of Unix, networking stack, OSI model, containers, and monitoring.
  • Programming skills in at least one language; capability to learn others.
  • Natural tendency to automate tasks.
  • Effective and asynchronous communication skills.
  • Care for the company, team, and self.
  • Embrace diversity and humility.
  • Action-oriented and iterative problem solving.
  • Preference for simplicity over complexity.
  • Ability to identify and address bottlenecks.
  • Proficiency in English communication.
  • Evolving our infrastructure platform building self-service components.
  • Working closely with Product and Infrastructure teams to develop infrastructure components.
  • Designing and implementing tooling for service availability, scalability, observability, and latency improvements.
  • Increasing reliability awareness with teams and reviewing implementations.
  • Defining SLIs, SLOs and SLAs as part of services' lifecycle.
  • Sharing an on-call schedule for owned platform services.
  • Solving problems in a highly available platform and building automations to prevent incidents.
  • Participating in the recruiting process to grow the engineering team.

AWSAWS EKSKubernetesMicroservicesNetworking

Posted 4 months ago
Apply
Apply

πŸ“ US, Portugal

🧭 Full-Time

πŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.
  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 5 months ago
Apply