Senior Machine Learning Systems Engineer, Ads ML Experience Platform
New
R
RedditMachine Learning
Remote - United StatesFull-TimeSenior
Salary$216,700 — $303,400 USD
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years in infrastructure/platform engineering or large-scale distributed systems; 2+ years of hands-on experience building and operating production ML infrastructure
- Required Skills
- KubeflowMachine LearningAirflowSparkDistributed Systems
Requirements
- 5+ years in infrastructure/platform engineering or large-scale distributed systems.
- 2+ years of hands-on experience building and operating production ML infrastructure, developer SDKs, platform APIs, or self-service AI tooling.
- Experience building workflow orchestration systems, developer platforms, or large-scale automation frameworks.
- Experience with distributed data processing systems such as Spark, Flink, Ray, or equivalent technologies.
- Experience with modern orchestration and workflow technologies such as Kubeflow, Argo, Airflow, or similar frameworks.
- Experience building offline ML experimentation platforms, model registries, experiment tracking systems, or training orchestration frameworks.
- Experience building and operating agentic AI systems, including multi-agent orchestration, autonomous workflows, and agent communication/runtime frameworks is a strong plus.
- Experience running end-to-end model development and iteration cycles at scale is a plus.
Responsibilities
- Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows.
- Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining.
- Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility.
- Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency.
- Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation.
- Design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows, including multi-agent orchestration, memory/context systems, and scalable workflow infrastructure.
View Full Description & ApplyYou'll be redirected to the employer's site