Principal ML Engineer, Machine Learning Platform and Systems Architecture

Remote Flexible remote work options across the United States and CanadaFull-TimePrincipal

Salary152,000 - 272,250 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
Required Skills: PythonKubernetesDistributed Systems

6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience
Strong expertise in designing and operating large-scale distributed systems and data platforms
Advanced proficiency in Python and strong production software engineering practices
Experience leading complex, cross-functional technical initiatives across multiple engineering teams
Strong background in ML infrastructure including model deployment, inference systems, and observability frameworks
Experience with large-scale data pipelines, cloud-native architectures, and distributed processing frameworks
Ability to make architectural decisions balancing scalability, performance, reliability, and cost
Strong communication and stakeholder management skills across technical and leadership audiences
Preferred: experience with Kubernetes, ML orchestration tools, data lineage systems, and ML-ready data representations (graph, geometry, multimodal)

Lead architecture and delivery of core ML platform capabilities including training, deployment, evaluation, and observability systems
Design scalable distributed systems for data processing, feature engineering, model lifecycle management, and production inference
Own end-to-end technical outcomes for platform initiatives, from architecture design through deployment and operational support
Develop and scale large data pipelines for structured and semi-structured datasets across distributed environments
Define and implement frameworks for model deployment, monitoring, observability, and system reliability
Establish data governance, lineage, and responsible data usage practices across ML infrastructure
Drive architecture for distributed processing systems using tools such as Ray, Spark, Airflow, or equivalent technologies
Lead incident response for critical platform issues and implement long-term system improvements
Mentor engineers, provide technical leadership, and establish best practices for ML system design and operations
Communicate technical strategies, tradeoffs, and architecture decisions to both technical and non-technical stakeholders

View Full Description & ApplyYou'll be redirected to the employer's site