- Own reliability outcomes across critical backend services and their integration with mobile clients (React Native).
- Lead technical problem-solving during incidents: coordinate response, diagnose root causes, communicate status, and drive to resolution.
- Build and evolve monitoring/observability (dashboards, alerts, tracing, logging) that enables fast detection and diagnosis.
- Drive post-incident reviews (blameless) and ensure learnings become durable fixes (tech changes, runbooks, automation, process updates).
- Improve engineering safety and quality: guardrails, safer migrations, feature-flag practices, rollout strategies, and resilience patterns.
- Partner with product, design, QA, and engineering early to align delivery plans with operational risk and reliability needs.
- Strengthen documentation and onboarding: architecture notes, runbooks, service ownership docs, and 'how we work' guides.
- Mentor engineers through pairing, reviews, incident shadowing, and pragmatic coaching on production ownership.