Platform Support Engineer (APAC)

L
Lightning AIAI Infrastructure
This role is remote and open to candidates based in either the Philippines or Singapore., UTC+8; Thursday–Sunday, 7:00 AM to 5:00 PM local timeFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
KubernetesMachine LearningPyTorchGrafanaPrometheusLinuxDistributed Systems

Requirements

  • Strong software engineering and systems troubleshooting background
  • Experience with Kubernetes and containerized environments
  • Linux systems knowledge, including networking, storage, process management, and performance tuning
  • Experience with cloud infrastructure and distributed systems
  • Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry
  • Hands on experience operating machine learning workloads in production or research environments
  • Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
  • Familiarity with GPU infrastructure and orchestration
  • Strong communication skills and ability to work directly with highly technical customers

Responsibilities

  • Partner directly with customer engineering teams running training and inference workloads in production
  • Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
  • Act as a technical advisor during high impact incidents and platform degradation events
  • Translate infrastructure level issues into actionable guidance for ML engineers
  • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
  • Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
  • Identify recurring patterns across customer issues and drive long term reliability improvements
  • Build internal tooling, automation, documentation, and runbooks
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now