Machine Learning Engineer, Reliability
New
F
falGenerative AI
This role will need to be based in India, Australia, or New ZealandFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 3+ years of professional experience, with 1 year experience operating production ML or high-scale API systems
- Required Skills
- PythonKubernetesMachine LearningPyTorchDistributed Systems
Requirements
- 3+ years of professional experience.
- 1+ year of experience operating production ML or high-scale API systems.
- Experience with on-call ownership.
- Strong systems fundamentals: distributed systems, networking, observability, and incident management.
- Working knowledge of modern generative models (diffusion, transformers) and their production failure modes.
- Bias toward automation, measurement, and blameless postmortems.
- Familiarity with security and safety practices for ML systems (abuse prevention, content safety, or trust & safety).
- Proficiency with Python, torch, and diffusers.
- Experience with Kubernetes.
Responsibilities
- Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
- Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, and model regressions.
- Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates.
- Drive the security posture of the model fleet, including abuse detection, rate limiting, and protection against adversarial usage.
- Operationalize safety systems for generative media, content moderation pipelines, and guardrails.
- Lead incident response for model API outages, conduct postmortems, and drive engineering improvements to prevent recurrence.
- Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.
- Partner with model and infrastructure teams to integrate reliability and safety requirements into model onboarding.
View Full Description & ApplyYou'll be redirected to the employer's site