- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)
- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring
- Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts, high-performance clients, circuit breakers and other production‑ready defaults
- Operate a secure, multi‑tenant model registry and training platform with standardized experiment/evaluation harnesses
- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks
- Integrate public/open‑source AI capabilities as managed platform services with cost and data‑governance guardrails
- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards
- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans
- Mentor the team and foster a platform-thinking mindset
PythonJavaKafka+6 more