fal

Private Company

Open Positions3

This role will need to be based in IndiaAustraliaOr New ZealandFull-TimeGenerative AIPosted

Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, and model regressions.
Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates.
Drive the security posture of the model fleet, including abuse detection, rate limiting, and protection against adversarial usage.
Operationalize safety systems for generative media, content moderation pipelines, and guardrails.
Lead incident response for model API outages, conduct postmortems, and drive engineering improvements to prevent recurrence.
Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.
Partner with model and infrastructure teams to integrate reliability and safety requirements into model onboarding.

PythonKubernetesMachine Learning+2 more

Showing 1 of 3 positions