Apply

Senior Site Reliability Engineer

Posted 2024-09-20

View full description

πŸ’Ž Seniority level: Senior, At least 4 years experience in maintaining Cloud infrastructure with modern technologies, at least 1 year experience in maintaining Web3 related infrastructure.

πŸ“ Location: Switzerland

πŸ” Industry: Web3 and blockchain infrastructure

🏒 Company: Gelato

πŸ—£οΈ Languages: English

⏳ Experience: At least 4 years experience in maintaining Cloud infrastructure with modern technologies, at least 1 year experience in maintaining Web3 related infrastructure.

πŸͺ„ Skills: AWSDockerPHPPythonSoftware DevelopmentEthereumGCPGitKubernetesTypeScriptAzureGoGrafanaPrometheusRustCI/CDMicroservices

Requirements:
  • At least 4 years experience in maintaining Cloud infrastructure with modern technologies
  • At least 1 year experience in maintaining Web3 related infrastructure
  • GitOps principles at heart
  • Ability to lead and positively influence peers in decision-making process
  • Ability to maintain high performance and accuracy in rapidly changing and evolving work settings
  • Experience in operating infrastructure on at least one major Cloud provider (GCP, AWS, Azure)
  • Experience with Docker and containerized applications
  • Experience with Unix based systems
  • Experience in operating and optimizing Kubernetes clusters
  • Experience with Git, Helm, Terraform, Kubectl and similar
  • Experience in networking, CDN, Gateways and deployment strategies
  • Experience in operating highly available infrastructure
  • Understanding of microservice based architecture and operations
  • Experience in advanced debugging, logging, monitoring and alerting using tools such as Prometheus, Grafana, Splunk, Datadog
  • Experience in implementing and maintaining cost optimized solutions
  • Experience with at least one programming language (e.g. Go, Python, Rust, PHP, TypeScript) and demonstrate capabilities in software development
  • Understanding of the Web3 technologies and related challenges including Rollups-as-a-Service (RaaS)
  • Eager to learn and grow professionally
  • Fluent in English (spoken and written)
Responsibilities:
  • Maintain and operate Gelato infrastructure in a multi-cloud environment
  • Contribute to improve our incident management lifecycle for overall reliability
  • Contribute to improve our Postmortem philosophy
  • Contribute to improve our DevOps culture
  • Deploy and maintain Rollups-as-a-Service (RaaS) core components and related observability stacks
  • Evaluate and modernize our existing infrastructure and deployment strategies to align with the latest industry standards
  • Maintain and enhance our CI/CD pipeline and its governance
  • Be on-call rotation to provide operational support and service availability
  • Participate and conduct regular team meetings
  • Provide insights and recommendations on system design and scalability, focusing on reliability, security and efficiency in a Web3 context
  • Be an active team member by always looking out for cost effective innovative solutions and by facilitating the adoption of industry standards
Apply

Related Jobs

Apply

πŸ“ US, Europe

🧭 Full-Time

πŸ’Έ 175000 - 210000 USD per year

πŸ” Cloud computing, AI

🏒 Company: CoreWeave

  • You have 5+ years of experience in the software or infrastructure engineering industry.
  • Experience with Python, Go or another scripting language.
  • Experience with how to containerize applications and/or have experience using Kubernetes to manage deployments.
  • Experience with Git.
  • Experience with Linux shell scripting and/or can navigate a *nix-based operating system.
  • Experience creating and maintaining GitHub Actions to automate workflows.
  • You have experience deploying services in production and are interested in learning reliability-at-scale engineering concepts.
  • You have experience refining SDLC, doing code reviews, and providing technical support.

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Streamline repetitive tasks and eliminate bottlenecks to improve development velocity with automated workflows and processes.
  • Partner with developers to understand their pain points and develop tailored solutions that enhance their productivity.
  • Champion best practices and advocate for new tools and technologies to drive ongoing productivity gains.
  • Tackle complex issues related to build systems, testing frameworks, code analysis, and other developer tooling.
  • Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams.

PythonSoftware DevelopmentGitKubernetes*NixGoCollaboration

Posted 2024-11-07
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109000 - 169000 USD per year

πŸ” Nonprofit, Technology

  • Proficient at automation/programming/scripting skills.
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
  • Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
  • Experience with managing remotely both bare-metal servers and virtualized environments.
  • 5+ years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience with high traffic and highly available website architectures and operations.
  • Strong English language skills.
  • Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
  • B.Sc. or M.Sc. in Computer Science or equivalent work experience.

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
  • Improving observability (alerting, metrics, monitoring) of database infrastructure.
  • Multi-datacenter systems design, capacity and infrastructure planning.
  • Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109000 - 169000 USD per year

πŸ” Nonprofit, knowledge sharing

  • Proficient at automation/programming/scripting skills
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.)
  • Advanced knowledge of Linux and IO/data storage concepts
  • Experience with managing remotely both bare-metal servers and virtualized environments
  • 5+ years experience in an SRE/Operations/DevOps role
  • Experience with high traffic and highly available website architectures
  • Strong English language skills
  • Ability to work independently in a fast paced environment

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution
  • Improving observability of database infrastructure
  • Designing multi-datacenter systems, capacity planning, and infrastructure planning
  • Participating in incident response and on-call rotation for system outages or alerts

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedisLinux

Posted 2024-08-28
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109047 - 169455 USD per year

πŸ” Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyC (Programming language)Data engineeringGoCommunication SkillsCollaboration

Posted 2024-08-22
Apply