Staff Software Engineer, Platform Reliability

Posted 16 days agoViewed

136000 - 170000 USD per year

United StatesFull-TimeSoftware Development

Company:Housecall Pro

Location:United States

Languages:English

Seniority level:Staff, 6-9+ years

Experience:6-9+ years

Skills:

AWSDockerPostgreSQLSQLCloud ComputingETLKubernetesMySQLData engineeringGrafanaPrometheusCI/CDLinuxDevOpsTerraformMicroservicesProblem SolvingMentoringComplianceData modelingSoftware EngineeringTroubleshooting

Requirements:

6–9+ years of experience as a Software Engineer, with significant exposure to operating production systems. Strong proficiency in reading, debugging, and improving large backend codebases. Experience building and operating distributed systems or service-oriented architectures. Solid understanding of performance engineering, failure modes, and reliability fundamentals at the code and system level. Hands-on experience with observability tools (metrics, logging, tracing) and using them to diagnose code-level issues. Experience working with relational databases (e.g., MySQL, PostgreSQL), including query optimization and schema design. Strong knowledge of Kubernetes, container orchestration, and cloud-native runtime environments. Experience participating in incident response and production on-call rotations. Strong communication skills and the ability to work collaboratively with feature teams on shared codebases.

Responsibilities:

Dive into service codebases to understand how implementation details, data access patterns, and architectural choices affect production behavior. Use metrics, logs, traces, and database telemetry to trace production issues back to specific code paths, queries, or design decisions. Partner with feature teams to debug complex reliability and performance issues, proposing concrete code changes and architectural improvements. Suggest and help implement improvements such as safer concurrency models, more efficient algorithms, better resource usage, and clearer service boundaries. Help teams adopt resilient coding patterns, including retries with backoff, circuit breakers, bulkheads, idempotency, and graceful degradation. Lead or contribute to post-incident reviews, translating operational failures into actionable engineering improvements. Design and evolve observability tooling that makes it easier for engineers to reason about code-level behavior in production. Review service and database interaction patterns to reduce latency, contention, and unnecessary load. Collaborate on database-related improvements, including schema design, query optimization, migration strategies, and scaling approaches. Contribute to reliability standards such as SLOs, service readiness expectations, and reliability scorecards. Mentor engineers by modeling strong debugging practices, thoughtful system design, and ownership of production software.