MLOps Lead

New
GermanyFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
Required Skills
AWSPythonGCPKubernetesMLFlowPyTorchTensorflowCI/CDTerraformMLOps

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  • Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
  • Strong software engineering expertise in Python, with working knowledge of Bash and/or Go.
  • Proven experience building, scaling, and leading MLOps infrastructure from the ground up.
  • Deep knowledge of machine learning platforms and frameworks such as MLflow, Weights & Biases (W&B), PyTorch, and TensorFlow.
  • Extensive experience with model serving technologies including Triton Inference Server, TorchServe, TensorFlow Serving, or KServe.
  • Hands-on expertise with Kubernetes, cloud platforms (AWS, GCP, or Azure), infrastructure as code tools (Terraform, Helm, GitOps), and production-grade data pipelines.
  • Strong experience with monitoring and observability solutions such as Prometheus, Grafana, Datadog, and OpenTelemetry.
  • Excellent communication skills with the ability to collaborate effectively across research and engineering teams.

Responsibilities

  • Lead, mentor, and develop a high-performing team of MLOps engineers while fostering a culture of collaboration, technical excellence, and continuous improvement.
  • Define and execute the MLOps roadmap, aligning infrastructure initiatives with research, engineering, and product objectives.
  • Design, implement, and maintain scalable machine learning infrastructure, including automated training pipelines, CI/CD workflows, orchestration frameworks, and deployment processes.
  • Drive architectural decisions for model serving platforms, ensuring low-latency, high-throughput inference using modern serving technologies.
  • Build and optimize feature stores, data pipelines, and storage solutions that support large-scale model training and production inference.
  • Collaborate closely with research teams to streamline the transition of machine learning models from experimentation to production environments.
  • Establish monitoring, logging, alerting, and observability strategies to ensure model performance, system reliability, and early detection of drift or operational issues.
  • Define engineering standards, operational best practices, and scalable infrastructure processes that support long-term platform growth.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now