Apply

AI/ML Infrastructure Engineer

Posted 2024-11-20

View full description

💸 Salary: 120000 - 150000 USD per year

🔍 Industry: Cloud Computing

🏢 Company: Vultr

Requirements:
  • Hands-on experience with high-performance NVIDIA GPUs.
  • In-depth experience automating bare metal internals including BIOS and firmware.
  • Experience with rail optimization across clusters and architectures.
  • Proficiency in Linux, package management, and device drivers.
  • Experience with commercial firmware.
  • Skills in Python, Bash, and PHP.
  • Experience with Machine Learning software.
Responsibilities:
  • Developing and maintaining infrastructure in bare metal and containerized environments.
  • Building scalable and supportable GPU clusters in collaboration with the networking team.
  • Ensuring consistent and reliable provisioning of GPU infrastructure for a positive customer experience.
  • Creating and maintaining test automation for fast, reliable provisioning of GPU products.
  • Benchmarking and performance testing of GPU systems to identify and resolve limitations.
  • Coordinating with vendors for drivers, support, and hardware issues.
Apply

Related Jobs

Apply

📍 United States

🔍 Mental health care technology

  • 5+ years of industry experience building production-level ML platforms and infrastructure.
  • Ability to write high-quality code in Python, Java, or Scala.
  • Experience building production-ready RESTful APIs and scaling platforms for large user bases.
  • Desire to own parts of an ML Platform with understanding of ML models and principles.
  • Experience with containers and deploying applications to Kubernetes.
  • Familiarity with LLMs and building infrastructure for LLM applications.
  • Experience with relational and low-latency databases.
  • Experience in transforming data in batch and streaming contexts.
  • Ability to manage large projects from scoping to delivery.
  • Strong communication and organizational skills, with the ability to simplify complex problems.

  • Be part of a team building scalable infrastructure for training, evaluating, deploying, performing inference, and monitoring ML models.
  • Build, deploy, and maintain generative AI services and applications.
  • Create data systems to collect, clean, label, and store data used for model features.
  • Deploy and manage applications in Kubernetes clusters.
  • Collaborate with Machine Learning engineers to support experimentation platforms and training frameworks.
  • Work with stakeholders to address requirements for ML infrastructure.

AWSPythonJavaKubeflowKubernetesMachine LearningMLFlowPyTorchCommunication SkillsRESTful APIsOrganizational skills

Posted 2024-11-21
Apply

Related Articles

Remote Job Certifications and Courses to Boost Your Career

August 22, 2024

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

August 19, 2024

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

August 18, 2024

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

August 16, 2024

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

August 13, 2024

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.