Staff Machine Learning Systems Engineer
New
Based in the United StatesFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- PythonKubernetesCI/CDTerraformDatadog
Requirements
- 8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles.
- Hands-on ML/AI systems experience.
- Strong expertise with Kubernetes (preferably EKS).
- Proficiency in infrastructure-as-code tools such as Terraform.
- Solid programming skills in Python.
- Experience operating LLM or ML inference systems in production.
- Hands-on experience with observability stacks (Datadog, OpenTelemetry).
- Strong understanding of CI/CD systems and GitOps workflows.
- Experience designing IAM, OIDC, and secrets management systems in cloud environments.
- Systems-thinking mindset with strong attention to reliability.
Responsibilities
- Lead the design, evolution, and operation of the core ML infrastructure platform.
- Own and optimize Kubernetes-based infrastructure, including autoscaling and workload orchestration.
- Build and maintain GitOps-based CI/CD pipelines.
- Design and implement model serving and inference infrastructure, including LLM routing and API gateways.
- Develop observability, tracing, and monitoring systems for AI workloads.
- Define and enforce SLOs, incident response processes, and reliability standards.
- Own infrastructure-as-code and platform tooling to improve developer velocity.
- Drive security, IAM, and secrets management architecture.
- Collaborate with ML, product, and data teams to translate prototypes into production-ready systems.
- Provide technical leadership and mentorship across ML systems engineering initiatives.
View Full Description & ApplyYou'll be redirected to the employer's site