Senior Site Reliability Engineer
New
T
Transcend Inc.B2B SaaS, Security, Data, Privacy
United StatesFull-TimeSenior
Salary170,000 - 185,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonJavascriptTypeScriptTerraformDatadogCloudFormation
Requirements
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or a closely related role
- Hands-on ownership of production systems
- Strong experience operating modern cloud infrastructure, ideally on AWS
- Proficiency with at least one programming language used at Transcend (e.g., JavaScript, Typescript, or Python)
- Comfort reading and reviewing application code for reliability and performance concerns
- Hands-on experience with infrastructure-as-code and CI/CD tooling (e.g., Terraform, CloudFormation, or similar; modern build/deploy pipelines)
- Deep familiarity with observability and monitoring systems (e.g., Datadog or equivalent)
- Proven track record running incident response and post-incident analysis
- Excellent communication and collaboration skills
- Experience working across multiple engineering teams to align on reliability goals, share context, and influence technical direction without formal authority
- Comfort participating in an on-call rotation
- Experience helping to design or improve on-call processes, runbooks, and escalation paths
- Minimum level of education: Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field, or equivalent practical experience
- Demonstrated ability to thrive in a remote-first, high-autonomy environment
Responsibilities
- Lead reliability-focused design and readiness reviews for new and existing services
- Build, operate, and continuously improve our observability stack (e.g., logging, metrics, tracing) to provide meaningful dashboards, alerts, and runbooks
- Own and evolve incident management practices, including on-call participation, incident response processes, and post-incident reviews
- Plan and execute disaster recovery exercises and game days to validate our resilience posture
- Perform capacity planning and cost optimization for our cloud infrastructure
- Identify and drive down systemic reliability risks across application, infrastructure, and process layers
- Collaborate closely with Developer Experience, Security, and product engineering to embed reliability best practices into shared tools and CI/CD pipelines
- Participate in and help continuously improve the on-call rotation
View Full Description & ApplyYou'll be redirected to the employer's site