Apply

Senior Site Reliability Engineer

Posted 2024-11-07

View full description

💎 Seniority level: Senior

📍 Location: Germany, Sweden, United Kingdom, Spain, Poland, Austria, CET, +/- 2 hours

🔍 Industry: Video Games

🪄 Skills: LeadershipProject ManagementProject CoordinationCross-functional Team LeadershipOperations Management

Requirements:
  • Experience in online operations support
  • Ability to work closely with production and architecture teams
  • Strong collaboration and communication skills
Responsibilities:
  • Serve as liaison between various development teams and the network operations team
  • Collaborate closely with the production team and system architect
  • Ensure that projects related to Hunt: Showdown are well planned, documented, and implemented
  • Handle operational and project duties.
Apply

Related Jobs

Apply

📍 Poland

🔍 Software

Posted 2024-11-21
Apply
Apply

📍 LATAM

🔍 AI developer tools

NOT STATED

  • Report to the Enterprise Engineering Manager.
  • Responsible for setting up and maintaining infrastructure standards.
  • Play a pivotal role in tool development externally and internally.
  • Enable deployment of software to enterprise customers.
  • Establish robust technical excellence for a diversified customer base.
  • Manage variances in infrastructure types and implement suitable solutions.
  • Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply
Apply

📍 Poland

🧭 Full-Time

🔍 Revenue intelligence

  • Experience in site reliability engineering, particularly with cloud infrastructure.
  • Ability to collaborate effectively with global engineering teams.
  • Strong problem-solving skills to address high-demand workloads.
  • Familiarity with automation tools and practices to enhance system reliability.

  • Define and design solutions for Clari to meet business needs regarding availability, performance, and reliability.
  • Influence system design and ensure the systems are resilient and fault-tolerant.
  • Optimize systems to improve efficiency and support a fast rate of innovation.
  • Set technical direction and lead solutions to large-scale problems.
  • Eliminate toil through automation and drive resiliency across engineering teams.

AWSLeadershipAgileCloud ComputingCross-functional Team LeadershipAmazon Web ServicesCommunication SkillsCollaboration

Posted 2024-11-07
Apply
Apply

📍 US, Europe

🧭 Full-Time

💸 175000 - 210000 USD per year

🔍 Cloud computing, AI

🏢 Company: CoreWeave

  • You have 5+ years of experience in the software or infrastructure engineering industry.
  • Experience with Python, Go or another scripting language.
  • Experience with how to containerize applications and/or have experience using Kubernetes to manage deployments.
  • Experience with Git.
  • Experience with Linux shell scripting and/or can navigate a *nix-based operating system.
  • Experience creating and maintaining GitHub Actions to automate workflows.
  • You have experience deploying services in production and are interested in learning reliability-at-scale engineering concepts.
  • You have experience refining SDLC, doing code reviews, and providing technical support.

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Streamline repetitive tasks and eliminate bottlenecks to improve development velocity with automated workflows and processes.
  • Partner with developers to understand their pain points and develop tailored solutions that enhance their productivity.
  • Champion best practices and advocate for new tools and technologies to drive ongoing productivity gains.
  • Tackle complex issues related to build systems, testing frameworks, code analysis, and other developer tooling.
  • Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams.

PythonSoftware DevelopmentGitKubernetes*NixGoCollaboration

Posted 2024-11-07
Apply
Apply

📍 United Kingdom

💸 65000 - 80000 GBP per year

🔍 Online marketplace

🏢 Company: OnBuy

  • Proven experience as a Senior Site Reliability Engineer or in a similar role.
  • Strong proficiency in programming languages such as Python, Go, or Java.
  • Experience with cloud service providers (AWS, Azure, Google Cloud) and container orchestration tools (Kubernetes, Docker).
  • Solid understanding of networking, distributed systems, and microservices architecture.
  • Familiarity with monitoring and logging tools (New Relic, Prometheus, Grafana, ELK stack, GCP logging).
  • Excellent problem-solving skills and ability to work effectively in a team.
  • Strong communication and interpersonal skills for collaboration with cross-functional teams.

  • Design and implement scalable systems to ensure high availability and performance.
  • Develop automated solutions for monitoring, scaling, and system health management.
  • Collaborate with software development teams to identify and resolve reliability issues.
  • Create and maintain documentation related to system architecture, processes, and configurations.
  • Perform incident response and postmortem analysis to improve site reliability and performance.
  • Monitor system performance and make necessary adjustments to ensure optimal functionality.
  • Implement and manage infrastructure as code using tools like Terraform or Ansible.

AWSDockerPythonSoftware DevelopmentGCPJavaKubernetesAzureGoGrafanaPrometheusDevOpsTerraformDocumentationMicroservices

Posted 2024-11-07
Apply
Apply

📍 Spain

🧭 Full-Time

💸 $72,000 - $99,000 per year

🔍 Mobility

  • Think Unix, you know the networking stack, the OSI model, containers (and schedulers), and you know your way around monitoring, logging and the CAP theorem (bonus!).
  • Have strong programming skills in at least one language, and know your way around a few more or can learn them if the opportunity arises.
  • Automate yourself out of everything by nature, making machines do the toil.
  • Communicate effectively and asynchronously.
  • Care about the things that affect the company, your team, and yourself.
  • Embrace diversity and humbleness (and a bit of trolling).
  • Prefer taking iterative action over waiting for things to happen or to be perfect.
  • Strongly favor simplicity over complexity. Ie, adhering to the KISS principle.
  • Have a sense for identifying, exploiting and elevating bottlenecks.
  • Are not afraid of expressing yourself in English. We aren't expecting you to have the Queen's accent, but you'll be part of an international team and we communicate in English, so you should be comfortable with that.
  • Enjoy herding cats and shaving yaks. Ie, being a great influence to other product teams and teaching them best practices. As well as analyzing and simplifying our setup.

  • Evolving our infrastructure platform building self-service components that will be used by all the engineering team and by millions of users around the world.
  • Working closely with our Product and Infrastructure teams to architecture and develop world-class infrastructure components.
  • Designing and implementing tooling to improve the availability, scalability, observability and latency of our services, which are used by internal customers to deploy and operate their services.
  • Increasing reliability awareness with other teams, helping with the adoption of reliability principles and reviewing observability implementations or software architectures.
  • Defining SLIs, SLOs and SLAs as part of the services' lifecycle.
  • Sharing an on-call schedule for the platform services you own.
  • Solving problems in our highly available platform together with other teams, then build automations to prevent incidents from happening again.
  • Participating in our recruiting process to help grow our engineering team.

AWSAWS EKSKubernetes

Posted 2024-09-19
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109000 - 169000 USD per year

🔍 Nonprofit, Technology

  • Proficient at automation/programming/scripting skills.
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
  • Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
  • Experience with managing remotely both bare-metal servers and virtualized environments.
  • 5+ years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience with high traffic and highly available website architectures and operations.
  • Strong English language skills.
  • Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
  • B.Sc. or M.Sc. in Computer Science or equivalent work experience.

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
  • Improving observability (alerting, metrics, monitoring) of database infrastructure.
  • Multi-datacenter systems design, capacity and infrastructure planning.
  • Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109000 - 169000 USD per year

🔍 Nonprofit, knowledge sharing

  • Proficient at automation/programming/scripting skills
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.)
  • Advanced knowledge of Linux and IO/data storage concepts
  • Experience with managing remotely both bare-metal servers and virtualized environments
  • 5+ years experience in an SRE/Operations/DevOps role
  • Experience with high traffic and highly available website architectures
  • Strong English language skills
  • Ability to work independently in a fast paced environment

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution
  • Improving observability of database infrastructure
  • Designing multi-datacenter systems, capacity planning, and infrastructure planning
  • Participating in incident response and on-call rotation for system outages or alerts

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedisLinux

Posted 2024-08-28
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109047 - 169455 USD per year

🔍 Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyC (Programming language)Data engineeringGoCommunication SkillsCollaboration

Posted 2024-08-22
Apply