Platform Support Engineer (APAC)
L
Lightning AIAI Infrastructure
This role is remote and open to candidates based in either the Philippines or Singapore., UTC+8; Thursday–Sunday, 7:00 AM to 5:00 PM local timeFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- KubernetesMachine LearningPyTorchGrafanaPrometheusLinuxDistributed Systems
Requirements
- Strong software engineering and systems troubleshooting background
- Experience with Kubernetes and containerized environments
- Linux systems knowledge, including networking, storage, process management, and performance tuning
- Experience with cloud infrastructure and distributed systems
- Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry
- Hands on experience operating machine learning workloads in production or research environments
- Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
- Familiarity with GPU infrastructure and orchestration
- Strong communication skills and ability to work directly with highly technical customers
Responsibilities
- Partner directly with customer engineering teams running training and inference workloads in production
- Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
- Act as a technical advisor during high impact incidents and platform degradation events
- Translate infrastructure level issues into actionable guidance for ML engineers
- Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
- Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
- Identify recurring patterns across customer issues and drive long term reliability improvements
- Build internal tooling, automation, documentation, and runbooks
View Full Description & ApplyYou'll be redirected to the employer's site