Senior ML Infrastructure / DevOps Engineer
EU, United States, and CanadaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSDockerPythonGCPJenkinsKubeflowKubernetesMLFlowPyTorchAirflowAzureGrafanaPrometheusTensorflowCI/CDLinuxTerraformGitHub ActionsCloudFormation
Requirements
- Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
- 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high-performance or ML workloads.
- Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
- Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
- Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
- Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
- Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
- Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
- Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
- Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow).
Responsibilities
- Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management).
- Automate infrastructure provisioning and configuration using infrastructure-as-code (Terraform, CloudFormation, cluster-tooling) and configuration management.
- Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback.
- Implement and evolve ML-centric CI/CD: testing, packaging, deployment of models and services.
- Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch).
- Work with terabyte-scale datasets and the associated storage, networking, and performance challenges.
- Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems.
- Participate in on-call rotation for critical ML infrastructure and lead incident response and post-mortems when things break.
View Full Description & ApplyYou'll be redirected to the employer's site