Job Details
- Experience
- Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
- Required Skills
- AWSPythonGCPKubernetesMLFlowPyTorchTensorflowCI/CDTerraformMLOps
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
- Strong software engineering expertise in Python, with working knowledge of Bash and/or Go.
- Proven experience building, scaling, and leading MLOps infrastructure from the ground up.
- Deep knowledge of machine learning platforms and frameworks such as MLflow, Weights & Biases (W&B), PyTorch, and TensorFlow.
- Extensive experience with model serving technologies including Triton Inference Server, TorchServe, TensorFlow Serving, or KServe.
- Hands-on expertise with Kubernetes, cloud platforms (AWS, GCP, or Azure), infrastructure as code tools (Terraform, Helm, GitOps), and production-grade data pipelines.
- Strong experience with monitoring and observability solutions such as Prometheus, Grafana, Datadog, and OpenTelemetry.
- Excellent communication skills with the ability to collaborate effectively across research and engineering teams.
Responsibilities
- Lead, mentor, and develop a high-performing team of MLOps engineers while fostering a culture of collaboration, technical excellence, and continuous improvement.
- Define and execute the MLOps roadmap, aligning infrastructure initiatives with research, engineering, and product objectives.
- Design, implement, and maintain scalable machine learning infrastructure, including automated training pipelines, CI/CD workflows, orchestration frameworks, and deployment processes.
- Drive architectural decisions for model serving platforms, ensuring low-latency, high-throughput inference using modern serving technologies.
- Build and optimize feature stores, data pipelines, and storage solutions that support large-scale model training and production inference.
- Collaborate closely with research teams to streamline the transition of machine learning models from experimentation to production environments.
- Establish monitoring, logging, alerting, and observability strategies to ensure model performance, system reliability, and early detection of drift or operational issues.
- Define engineering standards, operational best practices, and scalable infrastructure processes that support long-term platform growth.
View Full Description & ApplyYou'll be redirected to the employer's site