ML Infrastructure Engineer
New
L
LaterInfluencer marketing
Los Angeles, California, United StatesFull-TimeMiddle
Salary145,000 - 165,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 4+ years of experience
- Required Skills
- AWSDockerPythonFlaskGCPKubernetesMLFlowGrafanaPrometheusCI/CDTerraformBigQueryDatadogCloudFormation
Requirements
- 4+ years of experience in ML Ops, ML infrastructure, backend engineering, or related roles supporting production ML systems.
- Experience working in cloud-native environments (AWS and/or GCP) with hands-on deployment of ML workloads.
- Proven track record designing and implementing CI/CD pipelines for ML systems.
- Strong experience with Amazon SageMaker, Docker, Flask-based APIs, and infrastructure automation tools.
- Hands-on experience with ML lifecycle tooling such as MLflow, SageMaker Studio, or Weights & Biases.
- Experience managing container orchestration platforms (Kubernetes, EKS, or GKE).
- Strong programming experience in Python (additional experience in Go, Java, or Scala is a plus).
- Experience working with infrastructure-as-code tools such as Terraform or CloudFormation.
- Familiarity with observability tools such as CloudWatch, Prometheus, Grafana, Datadog, or centralized logging platforms.
- Experience managing GPU-based workloads and scaling training/inference systems.
- Familiarity with data infrastructure tools such as BigQuery and cloud-native data pipelines.
Responsibilities
- Define and own the long-term ML infrastructure roadmap, ensuring it supports both current experimentation needs and future AI initiatives.
- Establish best practices for model lifecycle management, deployment standards, monitoring, and governance.
- Design, build, and maintain production-grade model deployment and inference systems using CI/CD pipelines, containerized services (Docker), and API frameworks (e.g., Flask).
- Automate end-to-end ML lifecycle workflows including training pipelines, model validation, registry management, deployment, and rollback strategies.
- Implement robust monitoring systems for model performance, latency, drift detection, and infrastructure health using tools such as CloudWatch, Prometheus, and Grafana.
- Operate across AWS and GCP environments to manage training and inference workloads, including GPU-based infrastructure and BigQuery datasets.
- Develop and maintain infrastructure-as-code (Terraform, CloudFormation) to ensure scalable, repeatable, and secure cloud environments.
- Implement and optimize CI/CD workflows (e.g., GitHub Actions, GitLab CI, Bitbucket Pipelines) for ML and infrastructure automation.
- Partner closely with Data Scientists, Analysts, Platform Engineers, and Product Engineers to support end-to-end ML workflows.
- Stay current on emerging ML Ops practices, tools, and frameworks to continuously improve system reliability and efficiency.
View Full Description & ApplyYou'll be redirected to the employer's site