Apply

Senior Site Reliability Engineer, Core AI Infrastructure

Posted 8 days agoViewed

View full description

💎 Seniority level: Senior

📍 Location: USA

💸 Salary: 186065.0 - 218900.0 USD per year

🔍 Industry: Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🗣️ Languages: English

🪄 Skills: AWSBackend DevelopmentDockerPythonBashCloud ComputingGCPJavaKubernetesGoCI/CDRESTful APIsTerraformAnsibleScripting

Requirements:
  • Proven experience as a Site Reliability Engineer (SRE) or similar role.
  • Strong understanding of AI technologies and platforms.
  • Experience with deploying and managing applications in a cloud environment (AWS/GCP).
  • Solid backend development experience with programming languages such as Python, Java, or Go.
  • Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
  • Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and collaboration skills.
  • Strong security and compliance understanding.
  • Experience working in a highly regulated environment
  • Experience in a fast-paced, high-growth company
Responsibilities:
  • Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions
  • Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures
  • Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning
  • Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions
  • Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues
  • Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements
  • Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization
  • Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans
  • Contribute to backend development tasks to support the integration and functionality of AI tools
  • Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices
  • Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership
Apply