Platform Engineer – AI/ML Infrastructure

Posted 6 months agoViewed

United StatesFull-TimeVoice AI Platform

Company:Deepgram

Location:United States

Languages:English

Seniority level:Senior, 5+ years

Experience:5+ years

Skills:

AWSPythonBashKubernetesMachine LearningCI/CDLinuxDevOpsTerraform

Requirements:

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE). Proven, hands-on experience building and managing production infrastructure with Terraform. Expert-level knowledge of Kubernetes architecture and operations at scale. Experience with high-performance compute (HPC) job schedulers, specifically Slurm. Experience managing bare metal infrastructure, including server provisioning and lifecycle management. Strong scripting and automation skills (e.g., Python, Go, Bash).

Responsibilities:

Architect and maintain core computing platform using Kubernetes on AWS and on-premise. Develop and manage infrastructure using Infrastructure-as-Code (IaC) with Terraform. Design, build, and optimize AI/ML job scheduling and orchestration systems with Slurm. Provision, manage, and maintain on-premise bare metal server infrastructure for GPU computing. Implement and manage platform networking and storage solutions. Develop an observability stack and automation for operational tasks. Collaborate with AI researchers and ML engineers on infrastructure needs. Automate the lifecycle of single-tenant, managed deployments.