Principal ML Engineer, Machine Learning Platform and Systems Architecture
Remote
Flexible remote work options across the United States and CanadaFull-TimePrincipal
Salary152,000 - 272,250 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
- Required Skills
- PythonKubernetesDistributed Systems
Requirements
- 6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience
- Strong expertise in designing and operating large-scale distributed systems and data platforms
- Advanced proficiency in Python and strong production software engineering practices
- Experience leading complex, cross-functional technical initiatives across multiple engineering teams
- Strong background in ML infrastructure including model deployment, inference systems, and observability frameworks
- Experience with large-scale data pipelines, cloud-native architectures, and distributed processing frameworks
- Ability to make architectural decisions balancing scalability, performance, reliability, and cost
- Strong communication and stakeholder management skills across technical and leadership audiences
- Preferred: experience with Kubernetes, ML orchestration tools, data lineage systems, and ML-ready data representations (graph, geometry, multimodal)
Responsibilities
- Lead architecture and delivery of core ML platform capabilities including training, deployment, evaluation, and observability systems
- Design scalable distributed systems for data processing, feature engineering, model lifecycle management, and production inference
- Own end-to-end technical outcomes for platform initiatives, from architecture design through deployment and operational support
- Develop and scale large data pipelines for structured and semi-structured datasets across distributed environments
- Define and implement frameworks for model deployment, monitoring, observability, and system reliability
- Establish data governance, lineage, and responsible data usage practices across ML infrastructure
- Drive architecture for distributed processing systems using tools such as Ray, Spark, Airflow, or equivalent technologies
- Lead incident response for critical platform issues and implement long-term system improvements
- Mentor engineers, provide technical leadership, and establish best practices for ML system design and operations
- Communicate technical strategies, tradeoffs, and architecture decisions to both technical and non-technical stakeholders
View Full Description & ApplyYou'll be redirected to the employer's site