Apply

Site Reliability Engineer

Posted 3 days agoViewed

View full description

πŸ’Ž Seniority level: Middle, 4+ years

πŸ” Industry: Software Development

🏒 Company: Neon Inc.

πŸ—£οΈ Languages: English

⏳ Experience: 4+ years

Requirements:
  • 4+ years experience working in Site Reliability Engineering
  • Experience with cloud infrastructure components in Azure and/or AWS
  • Experience in a complex Linux infrastructure environment
  • Experience focusing on building repeatable and cost-efficient infrastructure
  • Experience building solutions for problems with no answers on Google
  • Experience working with monitoring solutions in the Prometheus ecosystem; Grafana, Loki, Tempo, VictoriaMetrics
  • Experience managing multi-cluster, multi-cloud Kubernetes deployments
Responsibilities:
  • Contribute to the foundation all of Neon is built upon
  • Contribute to building a stable and cost-efficient infrastructure foundation
  • Play a key role in ensuring we are proactive instead of reactive on infrastructure and reliability
  • Coach your fellow engineers on cloud, infrastructure, and reliability topics
  • Be ready to join an on-call rotation
Apply

Related Jobs

Apply

🧭 Full-Time

πŸ” Software Development

🏒 Company: AlgoliaπŸ‘₯ 501-1000πŸ’° $150,000,000 Series D over 3 years agoSemantic SearchSearch EngineCloud ComputingVertical Search

  • Experience with Kubernetes in production environments
  • Experience with cloud providers (GCP, AWS, or Azure)
  • Experience with automation and infrastructure as code (e.g., Terraform)
  • Solid knowledge of CI/CD pipelines and deployment automation
  • Familiarity with monitoring and observability tools (e.g., Datadog)
  • Excellent spoken and written English skills
  • Implement, maintain, and improve infrastructure for AI Platform
  • Ensure reliability and performance of Kubernetes-based deployments across cloud providers
  • Develop and maintain infrastructure as code
  • Optimize CI/CD pipelines and deployment processes
  • Enhance monitoring and observability systems
  • Contribute to incident response and post-mortem analysis
Posted about 6 hours ago
Apply
Apply

πŸ“ France

🧭 Full-Time

πŸ” Software Development

🏒 Company: AlgoliaπŸ‘₯ 501-1000πŸ’° $150,000,000 Series D over 3 years agoSemantic SearchSearch EngineCloud ComputingVertical Search

  • Experience with Kubernetes and container orchestration
  • Passion for automation and infrastructure as code
  • Experience with CI/CD pipelines
  • Familiarity with monitoring and observability tools
  • Desire to learn and grow in Site Reliability Engineering
  • Implement and maintain infrastructure for the AI Platform
  • Support Kubernetes-based deployments across cloud providers
  • Contribute to infrastructure as code development
  • Monitor and improve system reliability and performance

AWSPythonGCPKubernetesAzureGoCI/CDTerraform

Posted about 7 hours ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ” Software Development

🏒 Company: Fetch

  • 1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer)
  • Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++)
  • Experience with cloud infrastructure (AWS strongly preferred)
  • Experience with containerization technologies (Docker, Kubernetes preferred)
  • Experience building CI/CD pipelines
  • Experience with Unix/Linux operating system internals and networking
  • Experience with analyzing and troubleshooting systems
  • Experience monitoring and supporting microservice architectures
  • Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience
  • Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems by participating in the on-call rotation
  • Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS)
  • Grow the SRE team by mentoring engineers and participating in the hiring process

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 1 day ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Site Reliability Engineering

🏒 Company: JobgetherπŸ‘₯ 11-50πŸ’° $1,493,585 Seed almost 2 years agoInternet

  • 4+ years of experience in Site Reliability Engineering or similar role
  • Expertise in Infrastructure as Code with Terraform and Terragrunt
  • Deep knowledge of AWS cloud services
  • Experience with Confluent Cloud and Kafka for data streaming
  • Strong experience with Redis and RDS
  • Design, build, and maintain scalable cloud infrastructure using Terraform and Terragrunt
  • Manage AWS cloud environments for security and high availability
  • Oversee data streaming platforms with Confluent Cloud and Kafka
  • Maintain monitoring and alerting solutions using Prometheus and Grafana
  • Manage Kubernetes clusters with Helm, ArgoCD, and Istio

AWSElasticSearchKafkaKubernetesGrafanaPrometheusRedisCI/CDTerraform

Posted 2 days ago
Apply
Apply

πŸ“ United States, Europe

🧭 Full-Time

πŸ” Biotechnology

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure
  • Strong incident management skills
  • Technical skills in software reliability
  • Design, build, and maintain scalable cloud infrastructure
  • Develop and enforce SLIs and SLOs
  • Create CI/CD pipelines
  • Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted 2 days ago
Apply
Apply

🧭 Fulltime

πŸ” Software Development

🏒 Company: Sanity

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Proficient in building and maintaining CI/CD pipelines.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.
Posted 2 days ago
Apply
Apply

  • Bachelor’s degree (or equivalent) in computer science or related discipline
  • Knowledge of IaC technologies such as Terraform, Ansible, Puppet, Chef.
  • Knowledge of Cluster creation and management through Kubernetes
  • Knowledge of Microsoft Azure, AWS, Google Cloud, Azure services, Virtual Machine in Azure, Virtual Network Configuration.
  • Knowledge in design patterns such as: Iaas, Paas, and Saas
  • Knowledge in CI/CD
  • Scripting knowledge with PowerShell
  • IPs and Mask knowledge
  • Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
  • Responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services.
  • Design and enhance software architecture to improve scalability, service reliability, capacity, and performance.
  • Write automation code for provisioning and operating infrastructure at massive scale. You are not an operator, you’re an experienced software engineer focused on operations.
  • Work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up. You will work with QA on building pipelines and automation for delivering and deploying applications to production.
  • Roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause.
  • Write postmortem reviews and remediation recommendation.
  • Identify bad trends before they become problems; respond to automated system alerts, effectively troubleshoot system errors and work incidents to return systems to normal operating conditions
  • Author and update high-quality documentation of all relevant specifications, systems and procedures
  • Support and comply with the company’s Quality Management System policies and procedures.
Posted 2 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: Neon Inc.

  • 2+ years in an Engineering Management role
  • 5+ years of hands-on coding experience
  • Cloud experience with Azure and/or AWS
  • Strong knowledge of Kubernetes
  • Monitoring experience with Prometheus ecosystem
  • Excellent English communication skills
  • Manage a high-performing distributed team
  • Identify and eliminate obstacles
  • Coach and mentor engineers
  • Optimize processes and tech debt
  • Foster collaboration
  • Align projects with business goals
  • Maintain a scalable on-call process
  • Recruit and hire Software Engineers

AWSKubernetesAzureGoGrafanaPostgresPrometheusLinux

Posted 3 days ago
Apply
Apply

πŸ“ Canada

πŸ’Έ 147500.0 - 173500.0 CAD per year

πŸ” Software Development

🏒 Company: Life360πŸ‘₯ 251-500πŸ’° $33,038,258 Post-IPO Equity over 2 years agoπŸ«‚ Last layoff about 2 years agoAndroidFamilyAppsMobile AppsMobile

  • 3+ years of experience programming in Java, Python, or other formal programming language
  • Expert level experience (3+ years) managing medium to large-scale deployments on AWS (~5000 instances, 50+ accounts)
  • Strong Kubernetes experience (2+ years) deploying and managing at scale (100s of Deployments,10k+ containers, 20k+ Cores).
  • Strong Linux administration experience, shell/bash scripting.
  • Expert level experience with Infrastructure as code tools: Terraform, CloudFormation; config management/provisioning tools: Ansible, Chef, etc.
  • being opinionated on technical direction and strategy (and documenting those opinions for others to be able to follow),
  • leading and mentoring other engineers on the team
  • helping implement or diagnose the thorniest of the problems seen
  • Participate in shared on-call rotation (roughly one week every six weeks on call)
  • Estimate schedules, breaking tasks down to reasonable 1-3 day tasks.
  • Optimize for Cost Efficiency

AWSDockerPythonSQLBashCloud ComputingJavaJenkinsKafkaKubernetesREST APICI/CDLinuxTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Posted 3 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: trivagoπŸ‘₯ 1001-5000πŸ’° $52,541,981 Private about 14 years agoπŸ«‚ Last layoff almost 5 years agoInternetHospitalityMarketingInformation TechnologyHotelTravel

  • Over 5 years of expertise in SecDevOps or Cyber Security
  • Bachelor's degree in Computer Science or related field
  • Strong understanding of security frameworks and regulations
  • Proficiency in programming languages like Java, Kotlin, or Python
  • Good understanding of web application security principles
  • Develop and deploy hybrid cloud and on-premises solutions
  • Collaborate across security domains
  • Inspire engineers in secure design and operation
  • Raise security awareness company-wide
  • Identify cloud security needs and shape strategy

DockerPostgreSQLPythonCybersecurityGCPJavaKafkaKotlinKubernetesMySQLTerraformAnsible

Posted 4 days ago
Apply

Related Articles

Posted 6 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 6 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 6 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 6 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Posted 6 months ago

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.