Senior IaaS / Kubernetes Platform Engineer

New

Worldwide remote, work anywhereFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Languages: Strong written and verbal English (B2+ minimum)
Experience: 5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters
Required Skills: KubernetesLinuxTerraformAnsible

5+ years in infrastructure/platform engineering roles
3+ years operating production Kubernetes clusters (building and managing the platform itself)
Production experience with at least 3 of: KubeVirt, Cluster API (CAPI), Cilium/Calico, Rook-Ceph (100+ OSDs), ArgoCD/Flux
Deep Linux systems knowledge: kernel tuning, networking stack, filesystem operations, performance troubleshooting
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning
Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing
Strong written and verbal English (B2+ minimum)
Proactive mindset: identifying problems before incidents and driving improvements

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
Write Ansible playbooks for bare-metal server configuration and fleet management.
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.

View Full Description & ApplyYou'll be redirected to the employer's site