Apply

Member of Technical Staff - ML Infra

Posted 2024-11-07

View full description

📍 Location: Germany, USA

🔍 Industry: Generative image and video models

🏢 Company: Black Forest Labs

🪄 Skills: AWSGCPKubernetesAzureGrafanaPrometheusCI/CDTerraform

Requirements:
  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services.
  • Extensive experience with Kubernetes and Slurm cluster management.
  • Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible).
  • Proven track record in managing and optimizing network-based cloud file systems and object storage.
  • Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD).
  • Strong understanding of security principles and best practices in cloud environments.
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki).
  • Familiarity with ML workflows and GPU infrastructure management.
  • Demonstrated ability to handle complex migrations and breaking changes in production environments.
Responsibilities:
  • Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters.
  • Implement and manage network-based cloud file systems and blob/S3 storage solutions.
  • Develop and maintain Infrastructure as Code (IaC) for resource provisioning.
  • Implement and optimize CI/CD pipelines for ML workflows.
  • Design and implement custom autoscaling solutions for ML workloads.
  • Ensure security best practices across the ML infrastructure.
  • Provide developer-friendly tools and practices for efficient ML operations.
Apply