- Own platform reliability, infrastructure, observability, and incident response.
- Manage and scale sandbox infrastructure for AI agents.
- Design and configure monitoring, alerting, and observability pipelines.
- Lead structured incident responses and drive permanent root-cause fixes.
- Automate infrastructure using IaC and container orchestration.
- Inherit and maintain existing infrastructure (Terraform, OpenTelemetry, Prometheus/Grafana).
- Collaborate with AI Platform Engineers to secure and isolate tenant workloads.
DockerGCPKubernetes+4 more