Principal ML Engineer, Machine Learning Platform and Systems Architecture

Remote Flexible remote work options across the United States and CanadaFull-TimePrincipal
Salary152,000 - 272,250 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
Required Skills
PythonKubernetesDistributed Systems

Requirements

  • 6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience
  • Strong expertise in designing and operating large-scale distributed systems and data platforms
  • Advanced proficiency in Python and strong production software engineering practices
  • Experience leading complex, cross-functional technical initiatives across multiple engineering teams
  • Strong background in ML infrastructure including model deployment, inference systems, and observability frameworks
  • Experience with large-scale data pipelines, cloud-native architectures, and distributed processing frameworks
  • Ability to make architectural decisions balancing scalability, performance, reliability, and cost
  • Strong communication and stakeholder management skills across technical and leadership audiences
  • Preferred: experience with Kubernetes, ML orchestration tools, data lineage systems, and ML-ready data representations (graph, geometry, multimodal)

Responsibilities

  • Lead architecture and delivery of core ML platform capabilities including training, deployment, evaluation, and observability systems
  • Design scalable distributed systems for data processing, feature engineering, model lifecycle management, and production inference
  • Own end-to-end technical outcomes for platform initiatives, from architecture design through deployment and operational support
  • Develop and scale large data pipelines for structured and semi-structured datasets across distributed environments
  • Define and implement frameworks for model deployment, monitoring, observability, and system reliability
  • Establish data governance, lineage, and responsible data usage practices across ML infrastructure
  • Drive architecture for distributed processing systems using tools such as Ray, Spark, Airflow, or equivalent technologies
  • Lead incident response for critical platform issues and implement long-term system improvements
  • Mentor engineers, provide technical leadership, and establish best practices for ML system design and operations
  • Communicate technical strategies, tradeoffs, and architecture decisions to both technical and non-technical stakeholders
View Full Description & ApplyYou'll be redirected to the employer's site
152,000 - 272,250 USD per year
Apply Now