Machine Learning DevOps - Cloud & Compute Cluster - R&D Support
P
PathwayArtificial Intelligence
EU, United States, CanadaFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSDockerPythonGCPJenkinsKubeflowKubernetesMLFlowPyTorchAirflowAzureGrafanaPrometheusTensorflowLinuxTerraformGitHub ActionsCloudFormation
Requirements
- Very good familiarity with Linux, shell scripts, and cluster configuration scripts
- Proficiency in workload management, containerization and orchestration (Slurm, Docker, Kubernetes)
- Solid grasp of CI/CD tools and workflows (GitHub Actions, Jenkins, Gitlab CI)
- Cloud infrastructure knowledge (AWS, GCP, Azure) – especially in ML services (e.g., SageMaker Hyperpod, Vertex AI)
- Familiarity with monitoring/logging tools (Grafana, CloudWatch, Prometheus, Loki)
- Experience with infrastructure as code (Terraform, CloudFormation, cluster-toolkit)
- Experience with ML pipeline orchestration tools (e.g., MLflow, Kubeflow, Airflow, Metaflow)
- Programming skills in Python (with exposure to ML libraries like TensorFlow, PyTorch)
- Experience with cluster, systems, and networks administration
- BSc in Computer Science or Information Technology
Responsibilities
- Optimize infrastructure for ML training and inference (e.g., GPUs, distributed compute)
- Automate and maintain ML/LLM pipelines (data ingestion, training, validation, deployment)
- Manage model versioning, reproducibility, and traceability
- Work with terabyte-large datasets
- Implement ML-centric CI/CD practices
- Monitor model performance and data drift in production
- Collaborate with machine learning engineers, software engineers, and platform teams
- Operationalize machine learning models, ensuring scalability, reliability, and automation across the ML lifecycle
View Full Description & ApplyYou'll be redirected to the employer's site