Senior Site Reliability Engineer

Posted 3 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: United States

🔍 Industry: Cybersecurity

🪄 Skills: AWSDockerPythonCloud ComputingKubernetesDevOpsTerraform

Requirements:

Must be a self-starter with a passion for cloud technology.
Strong problem-solving abilities are essential.
Experience in major public clouds and automation is required.

Responsibilities:

As a Senior Site Reliability Engineer within the Cloud Services group, you will be responsible for operating cutting-edge offerings from Cloud Service Providers.
You will directly support leading cloud software companies to enhance the reliability and scalability of their SaaS products.
This role entails problem-solving and ensuring seamless service to large enterprises and government agencies.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 7 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer - Identity Platform

Posted 9 days ago

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

5+ years of experience building, iterating upon, and maintaining corporate IAM systems
5+ years of experience with operational procedures and application development
Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
Strong understanding of CI/CD workflows, automation frameworks, and best practices
Clear communication—demonstrate ability to explain technical concepts simply
Self starter—possess a continuous learning mindset
Demonstrate critical thinking under pressure

💡 Responsibilities

Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
Deliver configurations and maintain state using configuration management tools
Facilitate incident response, conduct root cause analysis, and blameless retrospectives
Define metrics and bolster monitoring/observability across corporate IAM systems
Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 10 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 14 days ago

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

🔧 Requirements

5+ years of experience as a Site Reliability Engineer or similar role.
Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
Strong problem-solving and analytical skills.
Excellent communication and collaboration skills.
Ability to work independently and take ownership of complex tasks.
Passion for technology and a desire to learn and grow.
Experience with Jenkins, PostgreSQL, and MongoDB.
Experience with cloud cost optimization, security best practices and tools.
Experience working in a fast-paced, agile environment.
Experience Rancher, Cattleprod, and TeamCity a plus.

💡 Responsibilities

Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
Implement automated monitoring and alerting systems for early detection of potential problems.
Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
Help onboard and mentor junior team members, sharing your knowledge and expertise.
Stay up to date on the latest cloud technologies and best practices for SRE.
Participate in a well-structured on-call rotation with other Site Reliability Engineers.
Explore new technologies and innovative solutions to improve service quality and speed to market.
Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 14 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 21 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago

Apply

🔥 Senior Site Reliability Engineer - Midnight

Posted 22 days ago

📍 United States

🧭 Full-Time

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

7+ years of experience in SRE, DevOps, or a related role.
Understanding of SRE best practices, architectures, and methods.
Good knowledge on resiliency patterns and cloud security.
Strong programming proficiency in Python, Golang, or Javascript.
Rust experience is advantageous
Demonstrated experience with AWS and modern cloud architectures.
Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
Hands-on experience with Kubernetes/EKS and GitOps methodologies.
Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
Ability to engage in technical discussions and be part of the decision making process
Strong problem-solving skills and capability to work on complex systems
Experience in working within an Agile environment
Experience in working with a distributed team
Strong communication and collaboration abilities to work seamlessly across different teams.
A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.

💡 Responsibilities

Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
Leverage GitOps principles to automate deployments and manage container orchestration.
Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
Develop automation tools and scripts to improve operational efficiency.
Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
Collaborate with dev teams to define and implement SLOs/SLIs
Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA

🧭 Full-Time

💸 160000.0 - 220000.0 USD per year

🔍 Healthcare

🏢 Company: Clover Health👥 501-1000💰 $300,000,000 Post-IPO Equity over 3 years ago🫂 Last layoff almost 2 years agoMedical Health Insurance Hospital Health Care

🔧 Requirements

5+ years of programming experience and are proficient in at least one of the following languages: Python, Go, or Shell Scripting.
In-depth knowledge of containerization technologies and orchestration, such as Docker, Containerd, and Kubernetes, along with experience with CNCF-based technologies like Helm, gRPC, and Prometheus.
Experience with public cloud platforms such as GCP, Azure, or AWS.
Knowledgeable in networking fundamentals, including TCP/IP, UDP, firewalls, routing, DNS, and load balancing.
Experience with Linux system administration and a solid understanding of Linux design principles.
Understand key SRE concepts, such as monitoring, performance tuning, and automation.
Can work autonomously with limited guidance, proactively identifying and solving problems.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams and adapt to new challenges and evolving technologies.

💡 Responsibilities

Build systems for declarative application and infrastructure lifecycle management, including continuous deployment, continuous integration, Kubernetes cluster management, and service/workload inventory.
Prioritize and troubleshoot infrastructure issues, minimizing downtime and responding to alerts efficiently.
Contribute to setting the direction of the Site Reliability Engineering (SRE) team, ensuring goals align with Counterpart Health’s company-wide objectives.
Foster a collaborative, high-performance culture that promotes motivation, innovation, and cross-disciplinary teamwork.
Streamline and automate infrastructure processes, including delivery pipelines and database changes.

AWSDockerPythonBashGCPKubernetesAzureGogRPCPrometheusCI/CDLinuxTerraformNetworkingAnsible

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 9 months agoData Management SaaS Application Performance Management

🔧 Requirements

NOT STATED

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending

AWSDockerCI/CDLinuxTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Vantage👥 1001-5000 Cryptocurrency Financial Services FinTech Trading Platform

🔧 Requirements

5 years of experience as a Site Reliability Engineer or DevOps Engineer, working with software and infrastructure.
Experience in one or more of the following: Python, Javascript, Ruby, Groovy, PHP, or Bash.
Experience in one of the cloud platforms: Azure, AWS, or GCP.

💡 Responsibilities

Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
Take ownership of the end-to-end availability and performance of our services, proactively identifying potential issues, and implementing automation to prevent the recurrence of problems.
Participate in an on-call rotation, ensuring our systems remain stable and responsive even during off-hours.
Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
Lead the development, implementation, and achievement of service-level objectives that are instrumental in maintaining product reliability.
Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines, enabling rapid and reliable software releases.
Automate and optimize our infrastructure provisioning, configuration, and management processes using industry-standard tools and best practices.
Implement and manage containerization and orchestration technologies to enhance scalability and resource utilization.
Maintain and enhance version control systems and repositories for codebase management.
Steer and drive the SRE / DevOps roadmap, assuming full ownership while actively engaging in negotiation and strategic planning to ensure its successful execution.
Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

AWSPythonSQLBashCloud ComputingGCPKubernetesSnowflakeAzureCI/CDRESTful APIsDevOpsTerraformTroubleshootingScriptingDebugging

Posted about 2 months ago

Apply