Senior Site Reliability Engineer

New

Transcend Inc.B2B SaaS, Security, Data, Privacy

United StatesFull-TimeSenior

Salary170,000 - 185,000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSPythonJavascriptTypeScriptTerraformDatadogCloudFormation

5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or a closely related role
Hands-on ownership of production systems
Strong experience operating modern cloud infrastructure, ideally on AWS
Proficiency with at least one programming language used at Transcend (e.g., JavaScript, Typescript, or Python)
Comfort reading and reviewing application code for reliability and performance concerns
Hands-on experience with infrastructure-as-code and CI/CD tooling (e.g., Terraform, CloudFormation, or similar; modern build/deploy pipelines)
Deep familiarity with observability and monitoring systems (e.g., Datadog or equivalent)
Proven track record running incident response and post-incident analysis
Excellent communication and collaboration skills
Experience working across multiple engineering teams to align on reliability goals, share context, and influence technical direction without formal authority
Comfort participating in an on-call rotation
Experience helping to design or improve on-call processes, runbooks, and escalation paths
Minimum level of education: Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field, or equivalent practical experience
Demonstrated ability to thrive in a remote-first, high-autonomy environment

Lead reliability-focused design and readiness reviews for new and existing services
Build, operate, and continuously improve our observability stack (e.g., logging, metrics, tracing) to provide meaningful dashboards, alerts, and runbooks
Own and evolve incident management practices, including on-call participation, incident response processes, and post-incident reviews
Plan and execute disaster recovery exercises and game days to validate our resilience posture
Perform capacity planning and cost optimization for our cloud infrastructure
Identify and drive down systemic reliability risks across application, infrastructure, and process layers
Collaborate closely with Developer Experience, Security, and product engineering to embed reliability best practices into shared tools and CI/CD pipelines
Participate in and help continuously improve the on-call rotation

View Full Description & ApplyYou'll be redirected to the employer's site