- Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
- Build the monitoring, alerting, and observability needed to catch ML-specific failures, output quality degradation, and model regressions.
- Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates.
- Drive the security posture of the model fleet, including abuse detection, rate limiting, and protection against adversarial usage.
- Operationalize safety systems for generative media, content moderation pipelines, and guardrails.
- Lead incident response for model API outages, conduct postmortems, and drive engineering improvements to prevent recurrence.
- Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.
- Partner with model and infrastructure teams to integrate reliability and safety requirements into model onboarding.
PythonKubernetesMachine Learning+2 more