Senior DevOps Engineer - Compute Platforms

This role is fully remote for candidates who reside outside the 50 mile radius of our San Ramon office. For candidates who reside within 50 miles of our San Ramon location, this role is Hybrid and would require 3 days a week (M, W, TH) in our San Ramon office.Full-TimeSenior

Salary82300 - 228800 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 6+ years
Required Skills: PythonBashGitKubernetesCI/CDLinuxTerraformAnsibleHelm

Requirements

6+ years of experience as a DevOps Engineer, Site Reliability Engineer, or Infrastructure Operations Engineer with a strong focus on compute
Strong hands-on experience operating bare metal compute environments at scale
Experience with PXE boot, automated OS provisioning, and server imaging systems
Practical experience supporting Bare Metal as a Service (BMaaS) platforms leveraging Redfish APIs
Strong Linux administration skills, especially with Ubuntu
Operational experience with virtualization and private cloud platforms, including KVM on Ubuntu, OpenStack operations and troubleshooting, Harvester HCI
Experience deploying and operating production Kubernetes environments
Expertise with enterprise compute hardware, including Cisco UCS, Dell PowerEdge, Supermicro systems and HPE
Proficiency with Infrastructure as Code tools (e.g., Terraform, Ansible, or similar)
Experience building or supporting CI/CD pipelines for infrastructure and platform automation
Strong scripting skills in Python, Bash, or similar languages
Strong understanding on SRE functions like toil reduction, error budgets and meeting SLAs
Proven troubleshooting and root cause analysis skills in complex distributed systems
Excellent written and verbal communication skills
Bachelor’s degree in computer science or equivalent professional experience

Responsibilities

Operate and support enterprise compute platforms across hardware, OS, virtualization, and container orchestration layers
Deploy and maintain bare metal server infrastructure for Ubuntu OS with Kubernetes and hypervisors including Openstack & Harvester
Implement and maintain PXE-based provisioning environments leveraging Redfish APIs for large-scale server deployments
Install, patch, and maintain operating systems including Ubuntu and Harvester
Operate and support virtualization and private cloud platforms, including KVM on Ubuntu, OpenStack environments and Harvester HCI
Develop Infrastructure-as-Code using Ansible, Terraform, Helm and Git, with Python/Bash automation
Implement CI/CD pipelines for infrastructure updates, patching, upgrades, testing, and rollback
Perform firmware updates, patch management, and hardware health validation
Monitor system performance, capacity, and availability; proactively address reliability risks
Troubleshoot complex cross-stack issues spanning hardware, OS, virtualization, OpenStack, and Kubernetes
Manage to SLAs, KPIs and error budgets
Participate in on-call escalation support for complex platform-related issues
Collaborate globally on change management, documentation, and operational best practices
Develop and maintain runbooks, operational procedures, and technical documentation

View Full Description & ApplyYou'll be redirected to the employer's site