- Own the reliability of the event-driven messaging layer, including backpressure management, idempotency, dead-letter handling, and retry strategies.
- Build and operate the infrastructure that runs LLM orchestration workloads at scale.
- Own the operational data layer for the CI runtime, including state management, session persistence, and real-time data access patterns.
- Own observability for the CI platform, including structured logging, distributed tracing (OpenTelemetry), and error tracking (Sentry).
- Maintain and harden the interfaces between CI and downstream platforms, including contract testing, versioning, and failure handling.
- Conduct code reviews and mentor team members on Python engineering practices and production readiness.
- Own production support for CI infrastructure, including on-call responsibilities and incident response.
PythonAzureNosql+2 more