- Design, develop, and deploy scalable, high-performance, and production-grade backend services and distributed systems to support large-scale model inference.
- Contribute significantly to the technical roadmap and design of our inference platform, focusing on low-latency, high-throughput services.
- Ensure the reliability, scalability, and efficiency of our systems in production using monitoring and observability tools like Prometheus and Grafana.
- Partner cross-functionally with data science, product, and engineering teams to align platform capabilities with strategic business goals.
- Manage and optimize our cloud infrastructure (GCP) and orchestrate workloads with Kubernetes.
- Promote and implement best practices for backend service development, testing, deployment, and monitoring (DevOps, SRE).