Senior Site Reliability Engineer

New
T
Transcend Inc.B2B SaaS, Security, Data, Privacy
United StatesFull-TimeSenior
Salary170,000 - 185,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSPythonJavascriptTypeScriptTerraformDatadogCloudFormation

Requirements

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or a closely related role
  • Hands-on ownership of production systems
  • Strong experience operating modern cloud infrastructure, ideally on AWS
  • Proficiency with at least one programming language used at Transcend (e.g., JavaScript, Typescript, or Python)
  • Comfort reading and reviewing application code for reliability and performance concerns
  • Hands-on experience with infrastructure-as-code and CI/CD tooling (e.g., Terraform, CloudFormation, or similar; modern build/deploy pipelines)
  • Deep familiarity with observability and monitoring systems (e.g., Datadog or equivalent)
  • Proven track record running incident response and post-incident analysis
  • Excellent communication and collaboration skills
  • Experience working across multiple engineering teams to align on reliability goals, share context, and influence technical direction without formal authority
  • Comfort participating in an on-call rotation
  • Experience helping to design or improve on-call processes, runbooks, and escalation paths
  • Minimum level of education: Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field, or equivalent practical experience
  • Demonstrated ability to thrive in a remote-first, high-autonomy environment

Responsibilities

  • Lead reliability-focused design and readiness reviews for new and existing services
  • Build, operate, and continuously improve our observability stack (e.g., logging, metrics, tracing) to provide meaningful dashboards, alerts, and runbooks
  • Own and evolve incident management practices, including on-call participation, incident response processes, and post-incident reviews
  • Plan and execute disaster recovery exercises and game days to validate our resilience posture
  • Perform capacity planning and cost optimization for our cloud infrastructure
  • Identify and drive down systemic reliability risks across application, infrastructure, and process layers
  • Collaborate closely with Developer Experience, Security, and product engineering to embed reliability best practices into shared tools and CI/CD pipelines
  • Participate in and help continuously improve the on-call rotation
View Full Description & ApplyYou'll be redirected to the employer's site
170,000 - 185,000 USD per year
Apply Now