Machine Learning Engineer, Reliability

New
F
falGenerative AI
This role will need to be based in India, Australia, or New ZealandFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
3+ years of professional experience, with 1 year experience operating production ML or high-scale API systems
Required Skills
PythonKubernetesMachine LearningPyTorchDistributed Systems

Requirements

  • 3+ years of professional experience.
  • 1+ year of experience operating production ML or high-scale API systems.
  • Experience with on-call ownership.
  • Strong systems fundamentals: distributed systems, networking, observability, and incident management.
  • Working knowledge of modern generative models (diffusion, transformers) and their production failure modes.
  • Bias toward automation, measurement, and blameless postmortems.
  • Familiarity with security and safety practices for ML systems (abuse prevention, content safety, or trust & safety).
  • Proficiency with Python, torch, and diffusers.
  • Experience with Kubernetes.

Responsibilities

  • Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
  • Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, and model regressions.
  • Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates.
  • Drive the security posture of the model fleet, including abuse detection, rate limiting, and protection against adversarial usage.
  • Operationalize safety systems for generative media, content moderation pipelines, and guardrails.
  • Lead incident response for model API outages, conduct postmortems, and drive engineering improvements to prevent recurrence.
  • Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.
  • Partner with model and infrastructure teams to integrate reliability and safety requirements into model onboarding.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now