- Design and implement scalable, secure, and cost-efficient MLOps solutions leveraging AWS and Databricks
- Automate ML deployment pipelines, reducing manual intervention and operational overhead
- Collaborate closely with data scientists to ensure solutions align with established MLOps architecture, best practices, and platform standards
- Integrate security controls and compliance requirements throughout the entire machine learning lifecycle
- Own and manage incidents end-to-end, from root cause analysis to prevention of future occurrences
- Contribute to software system architecture and the design of platform-level components
- Build and optimize ML training, retraining, and inference pipelines, ensuring reliability and scalability
- Enhance observability with metrics, logging, tracing, and dashboards to ensure system visibility and performance
- Drive best practices in infrastructure automation, CI/CD, and cloud resource management across ML teams
AWSDockerKubeflow+6 more