MLOps Lead

New

GermanyFull-TimeLead

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
Required Skills: AWSPythonGCPKubernetesMLFlowPyTorchTensorflowCI/CDTerraformMLOps

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
Strong software engineering expertise in Python, with working knowledge of Bash and/or Go.
Proven experience building, scaling, and leading MLOps infrastructure from the ground up.
Deep knowledge of machine learning platforms and frameworks such as MLflow, Weights & Biases (W&B), PyTorch, and TensorFlow.
Extensive experience with model serving technologies including Triton Inference Server, TorchServe, TensorFlow Serving, or KServe.
Hands-on expertise with Kubernetes, cloud platforms (AWS, GCP, or Azure), infrastructure as code tools (Terraform, Helm, GitOps), and production-grade data pipelines.
Strong experience with monitoring and observability solutions such as Prometheus, Grafana, Datadog, and OpenTelemetry.
Excellent communication skills with the ability to collaborate effectively across research and engineering teams.

Lead, mentor, and develop a high-performing team of MLOps engineers while fostering a culture of collaboration, technical excellence, and continuous improvement.
Define and execute the MLOps roadmap, aligning infrastructure initiatives with research, engineering, and product objectives.
Design, implement, and maintain scalable machine learning infrastructure, including automated training pipelines, CI/CD workflows, orchestration frameworks, and deployment processes.
Drive architectural decisions for model serving platforms, ensuring low-latency, high-throughput inference using modern serving technologies.
Build and optimize feature stores, data pipelines, and storage solutions that support large-scale model training and production inference.
Collaborate closely with research teams to streamline the transition of machine learning models from experimentation to production environments.
Establish monitoring, logging, alerting, and observability strategies to ensure model performance, system reliability, and early detection of drift or operational issues.
Define engineering standards, operational best practices, and scalable infrastructure processes that support long-term platform growth.

View Full Description & ApplyYou'll be redirected to the employer's site