Principal Engineer, AI Inference Reliability

Posted 2 months agoViewed

United States, CanadaFull-TimeAI Inference

Company:Cerebras Systems

Location:United States, Canada, EST, PST

Languages:English

Seniority level:Principal, 7+ years

Experience:7+ years

Skills:

Backend DevelopmentLeadershipPythonSoftware DevelopmentArtificial IntelligenceCloud ComputingKubernetesMachine LearningC++Cross-functional Team LeadershipGoRustCI/CDLinuxDevOpsProblem SolvingMentoring

Requirements:

Bachelor's or master's degree in computer science or related field. 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems. Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust. Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture. Excellent communication and cross-functional leadership skills. Bonus: prior experience building large-scale AI infrastructure systems.

Responsibilities:

Define and drive reliability strategy: establish SLOs and ensure alignment across engineering. Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers. Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents. Architect for reliability and observability: influence system design for redundancy, durability, and debuggability. Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection. Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service. Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights. Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems.