Site Reliability Engineer, Inference Infrastructure

New
C
CohereArtificial Intelligence
Remote-flexible, offices in Toronto, New York, San Francisco, London and ParisFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSGCPKubernetesC++AzureGoLinuxDistributed Systems

Requirements

  • 5+ years of engineering experience running production infrastructure at a large scale
  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads
  • Experience with Kubernetes dev and production coding and support
  • Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
  • Experience in compute/storage/network resource and cost management
  • Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators)
  • Experience in Golang, C++ or other languages designed for high-performance scalable servers

Responsibilities

  • Build self-service systems that automate managing, deploying and operating services.
  • Develop custom Kubernetes operators that support language model deployments.
  • Automate environment observability and resilience.
  • Enable all developers to troubleshoot and resolve problems.
  • Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation.
  • Build strong relationships with internal developers and influence the Infrastructure team’s roadmap.
  • Develop our team through knowledge sharing and an active review process.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now