Site Reliability Engineer

Posted 3 days agoViewed

💎 Seniority level: Middle, 4+ years

🔍 Industry: Software Development

🏢 Company: Neon Inc.

🗣️ Languages: English

⏳ Experience: 4+ years

Requirements:

4+ years experience working in Site Reliability Engineering

Experience with cloud infrastructure components in Azure and/or AWS

Experience in a complex Linux infrastructure environment

Experience focusing on building repeatable and cost-efficient infrastructure

Experience building solutions for problems with no answers on Google

Experience working with monitoring solutions in the Prometheus ecosystem; Grafana, Loki, Tempo, VictoriaMetrics

Experience managing multi-cluster, multi-cloud Kubernetes deployments

Responsibilities:

Contribute to the foundation all of Neon is built upon

Contribute to building a stable and cost-efficient infrastructure foundation

Play a key role in ensuring we are proactive instead of reactive on infrastructure and reliability

Coach your fellow engineers on cloud, infrastructure, and reliability topics

Be ready to join an on-call rotation

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer, AI Platform

Posted about 6 hours ago

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic Search Search Engine Cloud Computing Vertical Search

🔧 Requirements

Experience with Kubernetes in production environments
Experience with cloud providers (GCP, AWS, or Azure)
Experience with automation and infrastructure as code (e.g., Terraform)
Solid knowledge of CI/CD pipelines and deployment automation
Familiarity with monitoring and observability tools (e.g., Datadog)
Excellent spoken and written English skills

💡 Responsibilities

Implement, maintain, and improve infrastructure for AI Platform
Ensure reliability and performance of Kubernetes-based deployments across cloud providers
Develop and maintain infrastructure as code
Optimize CI/CD pipelines and deployment processes
Enhance monitoring and observability systems
Contribute to incident response and post-mortem analysis

Posted about 6 hours ago

Apply

🔥 Junior Site Reliability Engineer, AI Platform

Posted about 7 hours ago

📍 France

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic Search Search Engine Cloud Computing Vertical Search

🔧 Requirements

Experience with Kubernetes and container orchestration
Passion for automation and infrastructure as code
Experience with CI/CD pipelines
Familiarity with monitoring and observability tools
Desire to learn and grow in Site Reliability Engineering

💡 Responsibilities

Implement and maintain infrastructure for the AI Platform
Support Kubernetes-based deployments across cloud providers
Contribute to infrastructure as code development
Monitor and improve system reliability and performance

AWSPythonGCPKubernetesAzureGoCI/CDTerraform

Posted about 7 hours ago

Apply

🔥 Senior Site Reliability Engineer

Posted 1 day ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Fetch

🔧 Requirements

1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer)
Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++)
Experience with cloud infrastructure (AWS strongly preferred)
Experience with containerization technologies (Docker, Kubernetes preferred)
Experience building CI/CD pipelines
Experience with Unix/Linux operating system internals and networking
Experience with analyzing and troubleshooting systems
Experience monitoring and supporting microservice architectures
Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience

💡 Responsibilities

Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
Practice sustainable incident response and blameless postmortems by participating in the on-call rotation
Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS)
Grow the SRE team by mentoring engineers and participating in the hiring process

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 1 day ago

Apply

🔥 Site Reliability Engineer - (Remote - Canada)

Posted 2 days ago

📍 Canada

🧭 Full-Time

🔍 Site Reliability Engineering

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed almost 2 years agoInternet

🔧 Requirements

4+ years of experience in Site Reliability Engineering or similar role
Expertise in Infrastructure as Code with Terraform and Terragrunt
Deep knowledge of AWS cloud services
Experience with Confluent Cloud and Kafka for data streaming
Strong experience with Redis and RDS

💡 Responsibilities

Design, build, and maintain scalable cloud infrastructure using Terraform and Terragrunt
Manage AWS cloud environments for security and high availability
Oversee data streaming platforms with Confluent Cloud and Kafka
Maintain monitoring and alerting solutions using Prometheus and Grafana
Manage Kubernetes clusters with Helm, ArgoCD, and Istio

AWSElasticSearchKafkaKubernetesGrafanaPrometheusRedisCI/CDTerraform

Posted 2 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Biotechnology

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure
Strong incident management skills
Technical skills in software reliability

💡 Responsibilities

Design, build, and maintain scalable cloud infrastructure
Develop and enforce SLIs and SLOs
Create CI/CD pipelines
Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted 2 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 days ago

🧭 Fulltime

🔍 Software Development

🏢 Company: Sanity

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Proficient in building and maintaining CI/CD pipelines.
Familiarity with observability tools like Prometheus and similar stacks.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

Posted 2 days ago

Apply

🔥 Sr. Site Reliability Engineer (Remote, Mexico)

Posted 2 days ago

🔧 Requirements

Bachelor’s degree (or equivalent) in computer science or related discipline
Knowledge of IaC technologies such as Terraform, Ansible, Puppet, Chef.
Knowledge of Cluster creation and management through Kubernetes
Knowledge of Microsoft Azure, AWS, Google Cloud, Azure services, Virtual Machine in Azure, Virtual Network Configuration.
Knowledge in design patterns such as: Iaas, Paas, and Saas
Knowledge in CI/CD
Scripting knowledge with PowerShell
IPs and Mask knowledge
Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)

💡 Responsibilities

Responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services.
Design and enhance software architecture to improve scalability, service reliability, capacity, and performance.
Write automation code for provisioning and operating infrastructure at massive scale. You are not an operator, you’re an experienced software engineer focused on operations.
Work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up. You will work with QA on building pipelines and automation for delivering and deploying applications to production.
Roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause.
Write postmortem reviews and remediation recommendation.
Identify bad trends before they become problems; respond to automated system alerts, effectively troubleshoot system errors and work incidents to return systems to normal operating conditions
Author and update high-quality documentation of all relevant specifications, systems and procedures
Support and comply with the company’s Quality Management System policies and procedures.

Posted 2 days ago

Apply

🔥 Lead Site Reliability Engineer

Posted 3 days ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Neon Inc.

🔧 Requirements

2+ years in an Engineering Management role
5+ years of hands-on coding experience
Cloud experience with Azure and/or AWS
Strong knowledge of Kubernetes
Monitoring experience with Prometheus ecosystem
Excellent English communication skills

💡 Responsibilities

Manage a high-performing distributed team
Identify and eliminate obstacles
Coach and mentor engineers
Optimize processes and tech debt
Foster collaboration
Align projects with business goals
Maintain a scalable on-call process
Recruit and hire Software Engineers

AWSKubernetesAzureGoGrafanaPostgresPrometheusLinux

Posted 3 days ago

Apply

🔥 Senior II Site Reliability Engineer, Infrastructure

Posted 3 days ago

📍 Canada

💸 147500.0 - 173500.0 CAD per year

🔍 Software Development

🏢 Company: Life360👥 251-500💰 $33,038,258 Post-IPO Equity over 2 years ago🫂 Last layoff about 2 years agoAndroid Family Apps Mobile Apps Mobile

🔧 Requirements

3+ years of experience programming in Java, Python, or other formal programming language
Expert level experience (3+ years) managing medium to large-scale deployments on AWS (~5000 instances, 50+ accounts)
Strong Kubernetes experience (2+ years) deploying and managing at scale (100s of Deployments,10k+ containers, 20k+ Cores).
Strong Linux administration experience, shell/bash scripting.
Expert level experience with Infrastructure as code tools: Terraform, CloudFormation; config management/provisioning tools: Ansible, Chef, etc.

💡 Responsibilities

being opinionated on technical direction and strategy (and documenting those opinions for others to be able to follow),
leading and mentoring other engineers on the team
helping implement or diagnose the thorniest of the problems seen
Participate in shared on-call rotation (roughly one week every six weeks on call)
Estimate schedules, breaking tasks down to reasonable 1-3 day tasks.
Optimize for Cost Efficiency

AWSDockerPythonSQLBashCloud ComputingJavaJenkinsKafkaKubernetesREST APICI/CDLinuxTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Posted 3 days ago

Apply

🔥 Site Reliability Engineer - SecDevOps

Posted 4 days ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: trivago👥 1001-5000💰 $52,541,981 Private about 14 years ago🫂 Last layoff almost 5 years agoInternet Hospitality Marketing Information Technology Hotel Travel

🔧 Requirements

Over 5 years of expertise in SecDevOps or Cyber Security
Bachelor's degree in Computer Science or related field
Strong understanding of security frameworks and regulations
Proficiency in programming languages like Java, Kotlin, or Python
Good understanding of web application security principles

💡 Responsibilities

Develop and deploy hybrid cloud and on-premises solutions
Collaborate across security domains
Inspire engineers in secure design and operation
Raise security awareness company-wide
Identify cloud security needs and shape strategy

DockerPostgreSQLPythonCybersecurityGCPJavaKafkaKotlinKubernetesMySQLTerraformAnsible

Posted 4 days ago

Apply

Remote Job Certifications and Courses to Boost Your Career

Posted 6 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully

Remote Work Statistics and Insights for 2024