Senior Site Reliability Engineer

Posted about 2 months agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: US

💸 Salary: 176000.0 - 205300.0 USD per year

🔍 Industry: Logistics

🏢 Company: Flexe

⏳ Experience: 5+ years

🪄 Skills: AWSSoftware DevelopmentCloud ComputingGCPKubernetesAzureCI/CDDevOpsTerraform

Requirements:

5+ years of commercial Software Development / Site Reliability / DevOps experience.
Bachelor’s degree in Computer Science/Engineering or equivalent experience.
Experience with software engineering in both scripted and compiled languages.
Proven track record empowering other engineers and maintaining infrastructure as code.
Experience with at least one Cloud Platform provider including Google Cloud Platform, Amazon Web Services, or Microsoft Azure.
Experience with monitoring, log management, and CI/CD tools for distributed systems.
Experience with or a desire to learn technologies such as Kubernetes, Istio, and Terraform.

Responsibilities:

Evolving technology that is at the heart of our business.
Responsible for architecting and implementing systems that will impact our customers directly.
Working directly with customers and partners to design, implement, and deliver large scale technical projects.
Establishing goals for individual growth and mentoring other engineers.
Contributing to a healthy and productive culture by designing resilient systems.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted 4 days ago

📍 United States

💸 99300.0 - 124100.0 USD per year

🔍 Software Development

🏢 Company: Natera👥 1001-5000💰 $250,000,000 Post-IPO Equity over 1 year ago🫂 Last layoff almost 2 years agoWomen's Biotechnology Medical Genetics Health Diagnostics

🔧 Requirements

Strong all around experience in Amazon Web Services, AWS certification preferred.
Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
Solid experience with EKS, Kubernetes CKA certification preferred.
Strong experience with Terraform.
3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
Good understanding of Docker and Linux / Unix administration.
Practical experience building CI/CD pipelines using GitLab or similar tools.
Practical experience managing applications deployed using Docker in Cloud.
Experience with container orchestration tools.
Strong communication skills. Be able to justify and stand for the proper solution.

💡 Responsibilities

Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
Manage R&D AWS Infrastructure and accounts.
Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
Build and deploy cloud-based infrastructure to support R&D.
Participate in architectural decisions to help improve the quality of our infrastructure and applications.
Work tightly with groups within and external to R&D for best overall systems design and operations.
Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 4 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 17 days ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Fetch

🔧 Requirements

1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer).
Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++).
Experience with cloud infrastructure (AWS strongly preferred).
Experience with containerization technologies (Docker, Kubernetes preferred).
Experience building CI/CD pipelines.
Experience with Unix/Linux operating system internals and networking.
Experience with analyzing and troubleshooting systems.
Experience monitoring and supporting microservice architectures.
Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience.

💡 Responsibilities

Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems by participating in the on-call rotation.
Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS).
Grow the SRE team by mentoring engineers and participating in the hiring process.

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 17 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure management
Knowledge of CI/CD processes
Experience with incident management

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 22 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer - Americas

Posted 29 days ago

📍 Americas

🧭 Full-Time

💸 160000.0 - 180000.0 USD per year

🔍 Software Development

🏢 Company: Customer.io👥 251-500💰 Series A about 3 years agoDigital Media SaaS Product Search Software

🔧 Requirements

7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
Deep knowledge of UNIX environments and modern collaborative development practices.
Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
A strong focus on proactive identification and resolving issues in production environments.
A self-starter who thrives in both synchronous and asynchronous work environments.

💡 Responsibilities

Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
Streamline and automate processes for efficiency and reliability, removing manual toil.
Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
Troubleshoot and resolve production issues across various services and stack levels.
Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer, Database Operations:Clickhouse

Posted about 1 month ago

📍 California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

💸 117600.0 - 252000.0 USD per year

🔍 Software Development

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff about 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
Solid experience with at least one programming language: Go, Ruby or Python.
Advanced experience with Linux.
Extensive on-call experience as an SRE supporting mission critical systems.
Solid incident management experience across all phases.
Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.

💡 Responsibilities

Design, build, and maintain ClickHouse and PostgreSQL clusters.
Provision cloud infrastructure using configuration management and IaC tools.
Implement high-availability ClickHouse solutions.
Optimize PostgreSQL clusters for core applications.
Build monitoring and alerting tools to ensure resource optimization.
Respond to platform alerts and user emergencies.
Enhance infrastructure security and partner with compliance assessors.
Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 Worldwide

🧭 Contract

🔍 Software Development

🏢 Company: Teravision Technologies👥 251-500💰 over 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
Hands-on experience with AWS services such as S3, Route 53, and others.
Strong understanding of backend systems and infrastructure management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role.
Knowledge of monitoring and alerting tools to support on-call responsibilities.

💡 Responsibilities

NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 months ago

📍 United States

🔍 Cybersecurity

🔧 Requirements

Must be a self-starter with a passion for cloud technology.
Strong problem-solving abilities are essential.
Experience in major public clouds and automation is required.

💡 Responsibilities

As a Senior Site Reliability Engineer within the Cloud Services group, you will be responsible for operating cutting-edge offerings from Cloud Service Providers.
You will directly support leading cloud software companies to enhance the reliability and scalability of their SaaS products.
This role entails problem-solving and ensuring seamless service to large enterprises and government agencies.

AWSDockerPythonCloud ComputingKubernetesDevOpsTerraform

Posted 2 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE) - Disaster Recovery Specialist (m/f/x)

Posted 4 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🔧 Requirements

Degree in Computer Science or related field
5+ years experience in site reliability engineering
Proficiency in AWS, Azure, or Google Cloud
Experience with IaC tools like Terraform or CloudFormation

💡 Responsibilities

Develop and document disaster recovery plans and procedures
Collaborate with teams to identify and mitigate risks
Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 4 months ago

Apply