Senior Manager, Platform, Lifecycle, & Troubleshooting

New

VultrCloud Infrastructure

Remote - United StatesFull-TimeManager

Salary120,000 - 140,000 USD per year

Apply NowOpens the employer's application page

Job Details

8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
Proven track record leading technical teams, including on-call rotations and complex migrations.
Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
Excellent problem-solving, documentation, and cross-team communication abilities.
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
Validate firmware choices and handle complex/ongoing firmware updates.
Provide 24/7 on-call leadership and drive incident response improvements.
Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
Mentor senior engineers and build a high-performing team focused on root-cause analysis.
Track key metrics (uptime, incident trends, migration success) and drive operational maturity.

View Full Description & ApplyYou'll be redirected to the employer's site