ML Infrastructure Engineer

New

LaterInfluencer marketing

Los Angeles, California, United StatesFull-TimeMiddle

Salary145,000 - 165,000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 4+ years of experience
Required Skills: AWSDockerPythonFlaskGCPKubernetesMLFlowGrafanaPrometheusCI/CDTerraformBigQueryDatadogCloudFormation

4+ years of experience in ML Ops, ML infrastructure, backend engineering, or related roles supporting production ML systems.
Experience working in cloud-native environments (AWS and/or GCP) with hands-on deployment of ML workloads.
Proven track record designing and implementing CI/CD pipelines for ML systems.
Strong experience with Amazon SageMaker, Docker, Flask-based APIs, and infrastructure automation tools.
Hands-on experience with ML lifecycle tooling such as MLflow, SageMaker Studio, or Weights & Biases.
Experience managing container orchestration platforms (Kubernetes, EKS, or GKE).
Strong programming experience in Python (additional experience in Go, Java, or Scala is a plus).
Experience working with infrastructure-as-code tools such as Terraform or CloudFormation.
Familiarity with observability tools such as CloudWatch, Prometheus, Grafana, Datadog, or centralized logging platforms.
Experience managing GPU-based workloads and scaling training/inference systems.
Familiarity with data infrastructure tools such as BigQuery and cloud-native data pipelines.

Define and own the long-term ML infrastructure roadmap, ensuring it supports both current experimentation needs and future AI initiatives.
Establish best practices for model lifecycle management, deployment standards, monitoring, and governance.
Design, build, and maintain production-grade model deployment and inference systems using CI/CD pipelines, containerized services (Docker), and API frameworks (e.g., Flask).
Automate end-to-end ML lifecycle workflows including training pipelines, model validation, registry management, deployment, and rollback strategies.
Implement robust monitoring systems for model performance, latency, drift detection, and infrastructure health using tools such as CloudWatch, Prometheus, and Grafana.
Operate across AWS and GCP environments to manage training and inference workloads, including GPU-based infrastructure and BigQuery datasets.
Develop and maintain infrastructure-as-code (Terraform, CloudFormation) to ensure scalable, repeatable, and secure cloud environments.
Implement and optimize CI/CD workflows (e.g., GitHub Actions, GitLab CI, Bitbucket Pipelines) for ML and infrastructure automation.
Partner closely with Data Scientists, Analysts, Platform Engineers, and Product Engineers to support end-to-end ML workflows.
Stay current on emerging ML Ops practices, tools, and frameworks to continuously improve system reliability and efficiency.

View Full Description & ApplyYou'll be redirected to the employer's site