Infrastructure Engineer
New
F
Remote in the USFull-TimeStaff
Salary180,000 - 250,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 3+ years experience
- Required Skills
- PythonLinuxTerraformAnsible
Requirements
- 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes)
- Strong software engineering skills in Python; you write production tooling, not scripts
- Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
- Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init
- Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
- Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)
- Experience building internal tools or dashboards for infrastructure visibility
- Excellent communication and ability to drive technical decisions across teams
- Self-starter who executes quickly, takes ownership, and constantly seeks improvement
Responsibilities
- Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
- Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
- Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
- Leverage AI to an extreme level to build tools and automate alerting and recovery
- Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
- Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
- Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
- Develop a suite of automated error detection and recovery processes
- Work with partners to solve technical issues
View Full Description & ApplyYou'll be redirected to the employer's site