ML Infrastructure Engineer

New
L
LaterInfluencer marketing
Los Angeles, California, United StatesFull-TimeMiddle
Salary145,000 - 165,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
4+ years of experience
Required Skills
AWSDockerPythonFlaskGCPKubernetesMLFlowGrafanaPrometheusCI/CDTerraformBigQueryDatadogCloudFormation

Requirements

  • 4+ years of experience in ML Ops, ML infrastructure, backend engineering, or related roles supporting production ML systems.
  • Experience working in cloud-native environments (AWS and/or GCP) with hands-on deployment of ML workloads.
  • Proven track record designing and implementing CI/CD pipelines for ML systems.
  • Strong experience with Amazon SageMaker, Docker, Flask-based APIs, and infrastructure automation tools.
  • Hands-on experience with ML lifecycle tooling such as MLflow, SageMaker Studio, or Weights & Biases.
  • Experience managing container orchestration platforms (Kubernetes, EKS, or GKE).
  • Strong programming experience in Python (additional experience in Go, Java, or Scala is a plus).
  • Experience working with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Familiarity with observability tools such as CloudWatch, Prometheus, Grafana, Datadog, or centralized logging platforms.
  • Experience managing GPU-based workloads and scaling training/inference systems.
  • Familiarity with data infrastructure tools such as BigQuery and cloud-native data pipelines.

Responsibilities

  • Define and own the long-term ML infrastructure roadmap, ensuring it supports both current experimentation needs and future AI initiatives.
  • Establish best practices for model lifecycle management, deployment standards, monitoring, and governance.
  • Design, build, and maintain production-grade model deployment and inference systems using CI/CD pipelines, containerized services (Docker), and API frameworks (e.g., Flask).
  • Automate end-to-end ML lifecycle workflows including training pipelines, model validation, registry management, deployment, and rollback strategies.
  • Implement robust monitoring systems for model performance, latency, drift detection, and infrastructure health using tools such as CloudWatch, Prometheus, and Grafana.
  • Operate across AWS and GCP environments to manage training and inference workloads, including GPU-based infrastructure and BigQuery datasets.
  • Develop and maintain infrastructure-as-code (Terraform, CloudFormation) to ensure scalable, repeatable, and secure cloud environments.
  • Implement and optimize CI/CD workflows (e.g., GitHub Actions, GitLab CI, Bitbucket Pipelines) for ML and infrastructure automation.
  • Partner closely with Data Scientists, Analysts, Platform Engineers, and Product Engineers to support end-to-end ML workflows.
  • Stay current on emerging ML Ops practices, tools, and frameworks to continuously improve system reliability and efficiency.
View Full Description & ApplyYou'll be redirected to the employer's site
145,000 - 165,000 USD per year
Apply Now