Distributed Systems & Reliability Engineer

Posted 25 days agoViewed
United StatesFull-TimeSoftware Development
Company:Glydways
Location:United States
Languages:English
Seniority level:Senior
Skills:
Backend DevelopmentSoftware DevelopmentKubernetesSoftware ArchitectureC++JiraGoGrafanaPrometheusCI/CDRESTful APIsLinuxDevOpsMicroservices
Requirements:
Strong experience building and operating distributed, real-time backend systems (including C++ and Go services). Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Proven track record designing and implementing high-availability and failover patterns. Ability to design state replication and recovery mechanisms. Expertise in idempotent, restart-safe operations and APIs. Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition. Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent). Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests. Safety-critical mindset and comfort working in a requirements-driven environment. Strong ownership and collaboration skills.
Responsibilities:
Own the reliability, availability, and failover behavior of the centralized planning system in production. Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off. Define and build state continuity mechanisms so backup instances can take over from recent state. Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions. Extend and refine recovery behaviors, ensuring the system gets to a safe state first. Expand and maintain observability: logs, metrics, traces, dashboards, and alerts. Harden configuration, pipelines, and deployments for the system and related services. Design and maintain automated test and robustness suites. Apply safety-critical, requirements-driven reasoning to functional changes. Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior.
About the Company
Glydways
51-100 employeesTransportation
View Company Profile
Similar Jobs:
Posted 25 days ago
United StatesFull-TimeAI Infrastructure
Distributed Systems Engineer
Company:LiveKit
Posted about 1 year ago
United States, CanadaFull-TimeBlockchain, AI Infrastructure
Distributed Systems Engineer
Company:Ritual
Posted 7 months ago
United StatesFull-TimeSoftware Development
Software Engineer, Distributed Systems
Company:Figma