Apply

Operations Engineer, Fleet Reliability

Posted 5 days agoViewed

View full description

💎 Seniority level: Junior, 2 + years

💸 Salary: 90000.0 - 110000.0 USD per year

🔍 Industry: Software Development

🏢 Company: CoreWeave💰 $642,000,000 Secondary Market over 1 year agoCloud ComputingMachine LearningInformation TechnologyCloud Infrastructure

⏳ Experience: 2 + years

Requirements:
  • Strong understanding of Linux system administration and internals
  • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
  • Software development or scripting languages (bash, python, powershell, etc)
  • 2 + years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
  • Grafana, Prometheus, promsql queries or similar observability platforms
  • Data center environments including server racks, HVAC systems, fiber trays
  • Kubernetes administration
  • HPC - administering GPU-related workloads
  • Bachelor’s degree in a related field or equivalent experience
Responsibilities:
  • Configure and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
  • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware and platform teams to drive resolution
  • Monitor and analyze system performance and take appropriate remediation actions for cloud health
  • Approach your work with flexibility and optimism anticipating shifting business and technical priorities
  • Create and maintain documentation of team processes, knowledge and best practices for system management
  • Think critically about your day-to-day work and work collaboratively to improve team processes and efficiency
  • Participate in oncall rotations which include after hours and weekend work
Apply

Related Articles

Posted about 1 month ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 7 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.