Senior Manager, Platform, Lifecycle, & Troubleshooting
New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeManager
Salary120,000 - 140,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- PythonBashLinuxAnsible
Requirements
- 8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
- Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
- Proven track record leading technical teams, including on-call rotations and complex migrations.
- Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
- Excellent problem-solving, documentation, and cross-team communication abilities.
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Responsibilities
- Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
- Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
- Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
- Validate firmware choices and handle complex/ongoing firmware updates.
- Provide 24/7 on-call leadership and drive incident response improvements.
- Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
- Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
- Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
- Mentor senior engineers and build a high-performing team focused on root-cause analysis.
- Track key metrics (uptime, incident trends, migration success) and drive operational maturity.
View Full Description & ApplyYou'll be redirected to the employer's site