Machine Learning DevOps - Cloud & Compute Cluster - R&D Support

PathwayArtificial Intelligence

EU, United States, CanadaFull-TimeMiddle

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Required Skills: AWSDockerPythonGCPJenkinsKubeflowKubernetesMLFlowPyTorchAirflowAzureGrafanaPrometheusTensorflowLinuxTerraformGitHub ActionsCloudFormation

Very good familiarity with Linux, shell scripts, and cluster configuration scripts
Proficiency in workload management, containerization and orchestration (Slurm, Docker, Kubernetes)
Solid grasp of CI/CD tools and workflows (GitHub Actions, Jenkins, Gitlab CI)
Cloud infrastructure knowledge (AWS, GCP, Azure) – especially in ML services (e.g., SageMaker Hyperpod, Vertex AI)
Familiarity with monitoring/logging tools (Grafana, CloudWatch, Prometheus, Loki)
Experience with infrastructure as code (Terraform, CloudFormation, cluster-toolkit)
Experience with ML pipeline orchestration tools (e.g., MLflow, Kubeflow, Airflow, Metaflow)
Programming skills in Python (with exposure to ML libraries like TensorFlow, PyTorch)
Experience with cluster, systems, and networks administration
BSc in Computer Science or Information Technology

Optimize infrastructure for ML training and inference (e.g., GPUs, distributed compute)
Automate and maintain ML/LLM pipelines (data ingestion, training, validation, deployment)
Manage model versioning, reproducibility, and traceability
Work with terabyte-large datasets
Implement ML-centric CI/CD practices
Monitor model performance and data drift in production
Collaborate with machine learning engineers, software engineers, and platform teams
Operationalize machine learning models, ensuring scalability, reliability, and automation across the ML lifecycle

View Full Description & ApplyYou'll be redirected to the employer's site