Senior Manager, Platform, Lifecycle, & Troubleshooting

New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeManager
Salary120,000 - 140,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
PythonBashLinuxAnsible

Requirements

  • 8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
  • Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
  • Proven track record leading technical teams, including on-call rotations and complex migrations.
  • Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
  • Excellent problem-solving, documentation, and cross-team communication abilities.
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Responsibilities

  • Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
  • Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
  • Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
  • Validate firmware choices and handle complex/ongoing firmware updates.
  • Provide 24/7 on-call leadership and drive incident response improvements.
  • Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
  • Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
  • Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
  • Mentor senior engineers and build a high-performing team focused on root-cause analysis.
  • Track key metrics (uptime, incident trends, migration success) and drive operational maturity.
View Full Description & ApplyYou'll be redirected to the employer's site
120,000 - 140,000 USD per year
Apply Now