Senior Site Reliability Engineer (C#, .NET)
New
C
ClimavisionClimate Technology
Fully Remote - United States, Eastern Timezone preferredFull-TimeSenior
Salary135,000 - 170,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- Minimum of 7 years of experience in Site Reliability Engineering, DevOps, Production Engineering, Platform Engineering, or a related infrastructure-focused role, with at least 4 years in a role formally titled Site Reliability Engineer or carrying explicit SLO / error-budget accountability.
- Required Skills
- KubernetesC#Azure.NETCI/CDTerraformAnsibleDistributed Systems
Requirements
- Bachelor’s degree in computer science, software engineering, or related field (equivalent experience considered).
- Minimum 7 years experience in Site Reliability Engineering, DevOps, or Platform Engineering.
- At least 4 years in a role titled Site Reliability Engineer or with explicit SLO/error-budget accountability.
- Minimum 3 years of hands-on experience supporting and modifying C#/.NET applications in production.
- Demonstrated experience refactoring application code for horizontal scaling and multi-replica safety.
- Experience operating production workloads in self-managed or highly customized Kubernetes environments.
- Experience operating Kubernetes outside of managed cloud (e.g., bare-metal, colocation, edge, or hybrid).
- Strong understanding of infrastructure automation using Terraform and Ansible.
- Experience with CI/CD and production deployment pipelines (Octopus Deploy preferred).
- Experience with observability stacks (e.g., DataDog, Prometheus, Grafana, Loki, OpenTelemetry).
- Proven experience participating in structured production on-call rotations for business-critical systems.
- Strong troubleshooting skills across application, platform, and infrastructure layers.
Responsibilities
- Own production reliability for Climavision’s customer-facing platform and radar-derived weather data services across Azure, colocation, and edge Kubernetes environments.
- Refactor C#/.NET codebases to ensure services are safe to run as multiple instances and across clusters by addressing state, idempotency, and concurrency issues.
- Contribute to multi-cluster high-availability strategies, including failover behavior, traffic routing, and graceful degradation.
- Support and coordinate production incident response efforts, including troubleshooting, mitigation, and postmortem analysis.
- Operate and improve the self-managed Kubernetes platform, including lifecycle activities, upgrades, and platform maturity improvements.
- Partner with software engineering teams to improve production readiness, resiliency patterns, and deployment safety.
- Support and evolve the observability platform, including metrics, logging, distributed tracing, and alerting.
- Participate in a 24/7 on-call rotation requiring active response to incidents.
- Conduct performance engineering and capacity planning during peak weather events.
View Full Description & ApplyYou'll be redirected to the employer's site