Senior IaaS / Kubernetes Platform Engineer

New
Worldwide remote, work anywhereFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
Strong written and verbal English (B2+ minimum)
Experience
5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters
Required Skills
KubernetesLinuxTerraformAnsible

Requirements

  • 5+ years in infrastructure/platform engineering roles
  • 3+ years operating production Kubernetes clusters (building and managing the platform itself)
  • Production experience with at least 3 of: KubeVirt, Cluster API (CAPI), Cilium/Calico, Rook-Ceph (100+ OSDs), ArgoCD/Flux
  • Deep Linux systems knowledge: kernel tuning, networking stack, filesystem operations, performance troubleshooting
  • Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning
  • Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale
  • Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics
  • Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing
  • Strong written and verbal English (B2+ minimum)
  • Proactive mindset: identifying problems before incidents and driving improvements

Responsibilities

  • Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
  • Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
  • Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
  • Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
  • Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
  • Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
  • Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
  • Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
  • Write Ansible playbooks for bare-metal server configuration and fleet management.
  • Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now