Staff Machine Learning Systems Engineer

New
Based in the United StatesFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
PythonKubernetesCI/CDTerraformDatadog

Requirements

  • 8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles.
  • Hands-on ML/AI systems experience.
  • Strong expertise with Kubernetes (preferably EKS).
  • Proficiency in infrastructure-as-code tools such as Terraform.
  • Solid programming skills in Python.
  • Experience operating LLM or ML inference systems in production.
  • Hands-on experience with observability stacks (Datadog, OpenTelemetry).
  • Strong understanding of CI/CD systems and GitOps workflows.
  • Experience designing IAM, OIDC, and secrets management systems in cloud environments.
  • Systems-thinking mindset with strong attention to reliability.

Responsibilities

  • Lead the design, evolution, and operation of the core ML infrastructure platform.
  • Own and optimize Kubernetes-based infrastructure, including autoscaling and workload orchestration.
  • Build and maintain GitOps-based CI/CD pipelines.
  • Design and implement model serving and inference infrastructure, including LLM routing and API gateways.
  • Develop observability, tracing, and monitoring systems for AI workloads.
  • Define and enforce SLOs, incident response processes, and reliability standards.
  • Own infrastructure-as-code and platform tooling to improve developer velocity.
  • Drive security, IAM, and secrets management architecture.
  • Collaborate with ML, product, and data teams to translate prototypes into production-ready systems.
  • Provide technical leadership and mentorship across ML systems engineering initiatives.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now