Apply

Senior SRE Engineer

Posted 2024-10-28

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: USA

💸 Salary: 170000 - 210000 USD per year

🔍 Industry: Artificial Intelligence

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCommunication SkillsTerraform

Requirements:
  • Must have proven customer-facing and customer support experience.
  • 5+ years of experience in Linux system internals, scripting, and configuration management tools.
  • 5+ years of experience in running production systems over the cloud, using containerization technologies.
  • 5+ years of experience with cloud-based infrastructure services and related tools.
  • 5+ years of experience with monitoring applications.
  • 5+ years of experience using Helm charts to package, configure, and deploy Kubernetes applications.
  • Excellent communication skills, both verbal and written.
  • Passionate about troubleshooting and investigating in unfamiliar environments.
Responsibilities:
  • Develop and maintain all deployment options for Comet, including multi-cloud, on-premises, and bare-metal deployments.
  • Utilize Helm charts to package, configure, and deploy Kubernetes applications efficiently.
  • Quickly identify and resolve infrastructure bugs, ensuring high system availability and reliability.
  • Work closely with customers to understand their deployment needs and provide effective support.
  • Collaborate with cross-functional teams to ensure seamless integration and deployment of new features and updates.
Apply

Related Jobs

Apply
🔥 Senior SRE Engineer
Posted 2024-11-07

📍 United States

💸 130000 - 170000 USD per year

🔍 Data-Powered Marketing Cloud

🏢 Company: Zeta Global

  • 7+ years of experience as an SRE.
  • 3+ years of software development experience, emphasizing automation.
  • Hands-on experience with Infrastructure as Code (IaC) tools.
  • Experience with distributed systems and microservices architecture.
  • Production experience with distributed tracing.
  • Proficiency in Python and Bash scripting.
  • Solid understanding of SLIs, SLOs, and error budgets.
  • Experience with CI/CD platforms like GitOps or Jenkins.
  • Expertise in incident management and root cause analysis.
  • Knowledge of modern deployment strategies like Canary and Blue-Green.
  • Familiarity with resiliency patterns such as circuit breakers and load balancing.
  • Experience with SQL and NoSQL databases in distributed systems.
  • Proficiency in statistical analysis related to metrics.
  • Experience with high-performance and low-latency systems.
  • Experience with cloud cost optimization strategies.
  • Familiarity with distributed messaging systems like Kafka.
  • Strong understanding of security and compliance standards in SRE.

  • Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
  • Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
  • Analyze historical data to identify areas for improvement.
  • Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
  • Reduce toil through runbook automation and record key MTTx metrics.
  • Lead design sessions focusing on capacity planning and automation.
  • Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 2024-11-07
Apply
Apply

📍 US, Canada

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: Operant AI

  • 3+ years of active hands-on SRE experience in a fast-paced engineering organization working with SaaS/cloud-native products.
  • Hands-on experience with Kubernetes, Golang, and Python.
  • Knowledge of major cloud providers like AWS, GCP, Azure and their automation toolchains.
  • Excellent communication skills and ability to work independently.

  • Build and lead the DevOps/SRE functions for the company's product including monitoring for availability, security, reliability, and scale.
  • Document and codify best practices around operational behavior.
  • Build infrastructure for incident management, CI/CD, and maintain security best practices.
  • Track SOC2 compliance and create on-call schedules.

AWSPythonCybersecurityGCPKubernetesAzureGolangCommunication SkillsCI/CD

Posted 2024-10-25
Apply