Apply

Staff Site Reliability Engineer

Posted 5 months agoInactiveViewed

View full description

💎 Seniority level: Staff, 6+ years

📍 Location: Poland

🏢 Company: neptune.ai👥 51-100💰 $8,000,000 Series A almost 3 years agoInternetArtificial Intelligence (AI)AnalyticsInformation TechnologySoftware

🗣️ Languages: English

⏳ Experience: 6+ years

🪄 Skills: PythonElasticSearchGCPJVMKafkaKotlinKubernetesMicrosoft AzureMySQLAzureClickhouseRedisRustCommunication SkillsCollaborationCI/CDLinuxDevOpsTerraformDocumentationCompliance

Requirements:
  • 6+ years in SRE, DevOps, or related roles.
  • Strong experience managing and optimizing Kubernetes clusters.
  • Proven expertise in designing and implementing automation solutions, including Terraform and Helm.
  • Strong programming skills in Shell and Python.
  • Extensive experience with Linux system administration and network management.
  • Expertise in managing distributed computing systems.
  • Fluency in English with solid communication skills.
Responsibilities:
  • Own the site reliability process and systems through design, implementation, deployment, and maintenance.
  • Ensure scalability, resilience, and performance of solutions across SaaS and client-hosted environments.
  • Design and implement automation workflows to streamline operations.
  • Ensure security and compliance of infrastructure and processes.
  • Collaborate with cross-functional teams on requirements and solutions.
  • Document architecture and operational procedures.
  • Participate in on-call rotations for incident management.
Apply

Related Jobs

Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant over 5 years ago

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
  • Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
  • Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
  • Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
  • Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 13 days ago
Apply
Apply

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: xLabs👥 1-10EducationDigital MarketingE-LearningTraining

  • Experience with Infrastructure-as-Code and GitOps.
  • Experience with workload orchestration solutions, such as (but not limited to) HashiCorp Nomad or Kubernetes.
  • Can reason about software design and distributed systems.
  • Proficient in at least one programming language.
  • Experience running blockchains, particularly validators and remote signers/multisig. (nice to have)
  • Manage compute infrastructure and the applications that depend on it.
  • Document what we learn, including what went wrong and how to prevent future issues.

Software DevelopmentBlockchainKubernetes

Posted about 2 months ago
Apply