Senior IaaS / Kubernetes Platform Engineer
New
Worldwide remote, work anywhereFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- Strong written and verbal English (B2+ minimum)
- Experience
- 5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters
- Required Skills
- KubernetesLinuxTerraformAnsible
Requirements
- 5+ years in infrastructure/platform engineering roles
- 3+ years operating production Kubernetes clusters (building and managing the platform itself)
- Production experience with at least 3 of: KubeVirt, Cluster API (CAPI), Cilium/Calico, Rook-Ceph (100+ OSDs), ArgoCD/Flux
- Deep Linux systems knowledge: kernel tuning, networking stack, filesystem operations, performance troubleshooting
- Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning
- Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale
- Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics
- Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing
- Strong written and verbal English (B2+ minimum)
- Proactive mindset: identifying problems before incidents and driving improvements
Responsibilities
- Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
- Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
- Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
- Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
- Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
- Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
- Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
- Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
- Write Ansible playbooks for bare-metal server configuration and fleet management.
- Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
View Full Description & ApplyYou'll be redirected to the employer's site