Distributed Systems & Reliability Engineer

Posted 25 days agoViewed

United StatesFull-TimeSoftware Development

Company:Glydways

Location:United States

Languages:English

Seniority level:Senior

Skills:

Backend DevelopmentSoftware DevelopmentKubernetesSoftware ArchitectureC++JiraGoGrafanaPrometheusCI/CDRESTful APIsLinuxDevOpsMicroservices

Requirements:

Strong experience building and operating distributed, real-time backend systems (including C++ and Go services). Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Proven track record designing and implementing high-availability and failover patterns. Ability to design state replication and recovery mechanisms. Expertise in idempotent, restart-safe operations and APIs. Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition. Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent). Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests. Safety-critical mindset and comfort working in a requirements-driven environment. Strong ownership and collaboration skills.

Responsibilities:

Own the reliability, availability, and failover behavior of the centralized planning system in production. Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off. Define and build state continuity mechanisms so backup instances can take over from recent state. Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions. Extend and refine recovery behaviors, ensuring the system gets to a safe state first. Expand and maintain observability: logs, metrics, traces, dashboards, and alerts. Harden configuration, pipelines, and deployments for the system and related services. Design and maintain automated test and robustness suites. Apply safety-critical, requirements-driven reasoning to functional changes. Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior.