Staff Software Engineer, Platform Reliability

Posted 16 days agoViewed
136000 - 170000 USD per year
United StatesFull-TimeSoftware Development
Company:Housecall Pro
Location:United States
Languages:English
Seniority level:Staff, 6-9+ years
Experience:6-9+ years
Skills:
AWSDockerPostgreSQLSQLCloud ComputingETLKubernetesMySQLData engineeringGrafanaPrometheusCI/CDLinuxDevOpsTerraformMicroservicesProblem SolvingMentoringComplianceData modelingSoftware EngineeringTroubleshooting
Requirements:
6–9+ years of experience as a Software Engineer, with significant exposure to operating production systems. Strong proficiency in reading, debugging, and improving large backend codebases. Experience building and operating distributed systems or service-oriented architectures. Solid understanding of performance engineering, failure modes, and reliability fundamentals at the code and system level. Hands-on experience with observability tools (metrics, logging, tracing) and using them to diagnose code-level issues. Experience working with relational databases (e.g., MySQL, PostgreSQL), including query optimization and schema design. Strong knowledge of Kubernetes, container orchestration, and cloud-native runtime environments. Experience participating in incident response and production on-call rotations. Strong communication skills and the ability to work collaboratively with feature teams on shared codebases.
Responsibilities:
Dive into service codebases to understand how implementation details, data access patterns, and architectural choices affect production behavior. Use metrics, logs, traces, and database telemetry to trace production issues back to specific code paths, queries, or design decisions. Partner with feature teams to debug complex reliability and performance issues, proposing concrete code changes and architectural improvements. Suggest and help implement improvements such as safer concurrency models, more efficient algorithms, better resource usage, and clearer service boundaries. Help teams adopt resilient coding patterns, including retries with backoff, circuit breakers, bulkheads, idempotency, and graceful degradation. Lead or contribute to post-incident reviews, translating operational failures into actionable engineering improvements. Design and evolve observability tooling that makes it easier for engineers to reason about code-level behavior in production. Review service and database interaction patterns to reduce latency, contention, and unnecessary load. Collaborate on database-related improvements, including schema design, query optimization, migration strategies, and scaling approaches. Contribute to reliability standards such as SLOs, service readiness expectations, and reliability scorecards. Mentor engineers by modeling strong debugging practices, thoughtful system design, and ownership of production software.
Similar Jobs:
Posted 1 day ago
United StatesFull-TimeSoftware Development
Senior Full Stack Engineer
Company:Five9
Posted 1 day ago
North AmericasFull-TimeSoftware Development
Backend Engineer II - Minesweeper - Personalization
Company:
Posted 1 day ago
United StatesFull-TimeSoftware Development
Software Engineer
Company:Socket