Manager, Software Engineering (Resilience Engineering)
New
A
AffirmFinTech
Almost anywhere within the country of employmentFull-TimeManager
Salary200,000 - 275,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSPythonJavaKotlinKubernetes
Requirements
- Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
- Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
- Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
- Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
- Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
- Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
- Strong programming background (e.g., Python, Kotlin, Java, or similar).
- Excellent problem-solving skills and the ability to balance long-term resilience investments with immediate business needs.
- Strong communication and leadership skills, with a track record of influencing engineering practices across teams.
- Equivalent practical experience or a Bachelor’s degree in a related field.
Responsibilities
- Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices.
- Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
- Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle.
- Establish best practices for safely testing system limits and failure scenarios in production.
- Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.
- Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users.
- Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments.
- Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments.
- Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation.
- Work closely with engineering teams to design and execute production load tests and chaos experiments safely.
- Partner with infrastructure teams to build guardrails around tests and experimentations.
- Enable teams to adopt resilience practices by providing reusable tooling, frameworks, and standardized workflows.
- Identify systemic weaknesses and lead cross-functional efforts to improve reliability and fault tolerance.
- Evangelize a culture of “test failure before failure tests you” across the organization.
View Full Description & ApplyYou'll be redirected to the employer's site