Machine Learning DevOps - Cloud & Compute Cluster - R&D Support

P
PathwayArtificial Intelligence
EU, United States, CanadaFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSDockerPythonGCPJenkinsKubeflowKubernetesMLFlowPyTorchAirflowAzureGrafanaPrometheusTensorflowLinuxTerraformGitHub ActionsCloudFormation

Requirements

  • Very good familiarity with Linux, shell scripts, and cluster configuration scripts
  • Proficiency in workload management, containerization and orchestration (Slurm, Docker, Kubernetes)
  • Solid grasp of CI/CD tools and workflows (GitHub Actions, Jenkins, Gitlab CI)
  • Cloud infrastructure knowledge (AWS, GCP, Azure) – especially in ML services (e.g., SageMaker Hyperpod, Vertex AI)
  • Familiarity with monitoring/logging tools (Grafana, CloudWatch, Prometheus, Loki)
  • Experience with infrastructure as code (Terraform, CloudFormation, cluster-toolkit)
  • Experience with ML pipeline orchestration tools (e.g., MLflow, Kubeflow, Airflow, Metaflow)
  • Programming skills in Python (with exposure to ML libraries like TensorFlow, PyTorch)
  • Experience with cluster, systems, and networks administration
  • BSc in Computer Science or Information Technology

Responsibilities

  • Optimize infrastructure for ML training and inference (e.g., GPUs, distributed compute)
  • Automate and maintain ML/LLM pipelines (data ingestion, training, validation, deployment)
  • Manage model versioning, reproducibility, and traceability
  • Work with terabyte-large datasets
  • Implement ML-centric CI/CD practices
  • Monitor model performance and data drift in production
  • Collaborate with machine learning engineers, software engineers, and platform teams
  • Operationalize machine learning models, ensuring scalability, reliability, and automation across the ML lifecycle
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now