- Design and maintain scalable cloud environments (GCP/AWS) using Terraform
- Manage GPU/TPU resource allocation for training and fine-tuning
- Build internal services and CLI tools to streamline developer experience
- Design CI/CD and training pipelines using GitHub Actions, MLFlow, and Vertex AI Pipelines
- Develop reusable patterns for model serving and Kubernetes deployments
- Manage and optimize vector databases and embedding pipelines for RAG-based systems
- Implement model drift monitoring, resource utilization tracking, and LLM agent tracing
- Perform inference optimization (quantization, distillation) and cost management
GCPKubernetesMLFlow+3 more