Principal Engineer, AI Inference Reliability

Posted 2 months agoViewed
United States, CanadaFull-TimeAI Inference
Company:Cerebras Systems
Location:United States, Canada, EST, PST
Languages:English
Seniority level:Principal, 7+ years
Experience:7+ years
Skills:
Backend DevelopmentLeadershipPythonSoftware DevelopmentArtificial IntelligenceCloud ComputingKubernetesMachine LearningC++Cross-functional Team LeadershipGoRustCI/CDLinuxDevOpsProblem SolvingMentoring
Requirements:
Bachelor's or master's degree in computer science or related field. 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems. Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust. Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture. Excellent communication and cross-functional leadership skills. Bonus: prior experience building large-scale AI infrastructure systems.
Responsibilities:
Define and drive reliability strategy: establish SLOs and ensure alignment across engineering. Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers. Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents. Architect for reliability and observability: influence system design for redundancy, durability, and debuggability. Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection. Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service. Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights. Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems.
About the Company
Cerebras Systems
251-500 employeesComputer
View Company Profile
Similar Jobs:
Posted 23 days ago
United StatesFull-TimeAI, Text-to-Speech
AI Engineer & Researcher, Inference
Company:Speechify
Posted about 2 months ago
North America, South AmericaFull-TimeAI, Software Development
Senior AI Inference Engineer
Company:Monks
Posted 3 months ago
SF Bay Area, TorontoFull-TimeAI Chip Manufacturing
Sr. Deployment Engineer, AI Inference