Senior Software Engineer - Fleet Management
New
N
NscaleAI Infrastructure
Join our thriving remote-first team. Geography is no barrier to impact or connection.Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonDistributed Systems
Requirements
- 5+ years software engineering experience building and operating production systems
- Focus on infrastructure automation or workflow tooling
- Strong proficiency in Python
- Experience building distributed systems at scale
- Expertise in systems design tradeoffs
- Proficiency in using AI tools like Claude or Cursor
- Delivered automation systems from ambiguous requirements to operational production systems
- Hands-on day 2 operations experience (monitoring, incident response, performance optimisation)
- Ability to work independently in a fast-paced environment
Responsibilities
- Build workflow automation systems for GPU node and network switch lifecycle management at scale
- Design foundational platform components with established software patterns that others build on
- Implement device provisioning, burn-in testing, network configuration, and hardware health validation workflows
- Integrate with datacenter infrastructure management systems, cloud orchestration platforms, and bare metal provisioning tools
- Build distributed workflow orchestration systems to coordinate complex automation tasks across the fleet
- Drive technical strategy for reliability, observability, incident response, and operational excellence
- Partner with Infrastructure, Platform, and SRE teams to automate hardware lifecycle operations
- Use AI tools to accelerate delivery while maintaining architectural coherence
View Full Description & ApplyYou'll be redirected to the employer's site