Senior Site Reliability Engineer (C#, .NET)

New

ClimavisionClimate Technology

Fully Remote - United States, Eastern Timezone preferredFull-TimeSenior

Salary135,000 - 170,000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: Minimum of 7 years of experience in Site Reliability Engineering, DevOps, Production Engineering, Platform Engineering, or a related infrastructure-focused role, with at least 4 years in a role formally titled Site Reliability Engineer or carrying explicit SLO / error-budget accountability.
Required Skills: KubernetesC#Azure.NETCI/CDTerraformAnsibleDistributed Systems

Bachelor’s degree in computer science, software engineering, or related field (equivalent experience considered).
Minimum 7 years experience in Site Reliability Engineering, DevOps, or Platform Engineering.
At least 4 years in a role titled Site Reliability Engineer or with explicit SLO/error-budget accountability.
Minimum 3 years of hands-on experience supporting and modifying C#/.NET applications in production.
Demonstrated experience refactoring application code for horizontal scaling and multi-replica safety.
Experience operating production workloads in self-managed or highly customized Kubernetes environments.
Experience operating Kubernetes outside of managed cloud (e.g., bare-metal, colocation, edge, or hybrid).
Strong understanding of infrastructure automation using Terraform and Ansible.
Experience with CI/CD and production deployment pipelines (Octopus Deploy preferred).
Experience with observability stacks (e.g., DataDog, Prometheus, Grafana, Loki, OpenTelemetry).
Proven experience participating in structured production on-call rotations for business-critical systems.
Strong troubleshooting skills across application, platform, and infrastructure layers.

Own production reliability for Climavision’s customer-facing platform and radar-derived weather data services across Azure, colocation, and edge Kubernetes environments.
Refactor C#/.NET codebases to ensure services are safe to run as multiple instances and across clusters by addressing state, idempotency, and concurrency issues.
Contribute to multi-cluster high-availability strategies, including failover behavior, traffic routing, and graceful degradation.
Support and coordinate production incident response efforts, including troubleshooting, mitigation, and postmortem analysis.
Operate and improve the self-managed Kubernetes platform, including lifecycle activities, upgrades, and platform maturity improvements.
Partner with software engineering teams to improve production readiness, resiliency patterns, and deployment safety.
Support and evolve the observability platform, including metrics, logging, distributed tracing, and alerting.
Participate in a 24/7 on-call rotation requiring active response to incidents.
Conduct performance engineering and capacity planning during peak weather events.

View Full Description & ApplyYou'll be redirected to the employer's site