Platform Support Engineer (APAC)

This role is remote and open to candidates based in either the Philippines or Singapore., UTC+8; Thursday–Sunday, 7:00 AM to 5:00 PM local timeFull-TimeMiddle

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Required Skills: KubernetesMachine LearningPyTorchGrafanaPrometheusLinuxDistributed Systems

Requirements

Strong software engineering and systems troubleshooting background
Experience with Kubernetes and containerized environments
Linux systems knowledge, including networking, storage, process management, and performance tuning
Experience with cloud infrastructure and distributed systems
Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry
Hands on experience operating machine learning workloads in production or research environments
Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
Familiarity with GPU infrastructure and orchestration
Strong communication skills and ability to work directly with highly technical customers

Responsibilities

Partner directly with customer engineering teams running training and inference workloads in production
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
Act as a technical advisor during high impact incidents and platform degradation events
Translate infrastructure level issues into actionable guidance for ML engineers
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
Identify recurring patterns across customer issues and drive long term reliability improvements
Build internal tooling, automation, documentation, and runbooks

View Full Description & ApplyYou'll be redirected to the employer's site