Senior ML Infrastructure / DevOps Engineer

EU, United States, and CanadaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSDockerPythonGCPJenkinsKubeflowKubernetesMLFlowPyTorchAirflowAzureGrafanaPrometheusTensorflowCI/CDLinuxTerraformGitHub ActionsCloudFormation

Requirements

  • Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high-performance or ML workloads.
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow).

Responsibilities

  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management).
  • Automate infrastructure provisioning and configuration using infrastructure-as-code (Terraform, CloudFormation, cluster-tooling) and configuration management.
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback.
  • Implement and evolve ML-centric CI/CD: testing, packaging, deployment of models and services.
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch).
  • Work with terabyte-scale datasets and the associated storage, networking, and performance challenges.
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems.
  • Participate in on-call rotation for critical ML infrastructure and lead incident response and post-mortems when things break.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now