Site Reliability Engineer

Posted about 2 months agoViewed

💎 Seniority level: Middle, 4+ years

📍 Location: United States, Canada

🔍 Industry: Web3 Infrastructure

🏢 Company: Anagram👥 11-50💰 $9,100,000 Series A about 5 years agoInternet Eyewear B2B InsurTech Enterprise Software Insurance Health Care Software

🗣️ Languages: English

⏳ Experience: 4+ years

🪄 Skills: AWSPythonBashBlockchainGCPKubernetesAzureGoRustTerraform

Requirements:

4+ years of experience in DevOps, SRE, or Backend Engineering roles.
Strong coding skills in Python, Go, Rust, or Bash for automation and tooling.
Experience with cloud-native environments (AWS, GCP, Azure) and container orchestration.

Responsibilities:

Design and implement scalable infrastructure using Terraform, Kubernetes, and containerized environments.
Develop monitoring, logging, and alerting solutions to maintain system health and minimize downtime.
Build and optimize CI/CD pipelines, ensuring efficient deployment of backend services and smart contracts.
Identify and resolve bottlenecks in distributed systems to improve scalability and efficiency.
Implement and enforce security policies, including key management, access controls, and network security.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer (CST or EST Remote)

Posted 1 day ago

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

🔧 Requirements

3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
Experience with AWS, Azure, or GCP cloud infrastructure
Experience with PHP and Javascript/Typescript
Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.

💡 Responsibilities

Be the escalation point for problems and incidents for our Customer Support teams.
Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
Internally communicate the status of problems and incidents.
Generate Root Cause Analysis (RCA) statements for internal and external use.
Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
Proactively explore and drive improvements to the overall quality and reliability of our software platform.
Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 5 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 5 days ago

Apply

🔥 Sr. Site Reliability Engineer - GovCloud (Remote)

Posted 6 days ago

📍 United States

🧭 Full-Time

💸 95000.0 - 160000.0 USD per year

🔍 Cybersecurity

🏢 Company: crowdstrikecareers

🔧 Requirements

5-7+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure roles.
Experience managing Virtual Desktop Infrastructure (VDI) solutions such as Citrix, VMware Horizon, or AWS WorkSpaces.
Hands-on experience with AWS GovCloud (Azure/GCP is a plus).
Strong expertise in Infrastructure as Code (Terraform, CloudFormation).
Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog, Splunk).
Expertise in IAM and PAM solutions such as Okta, CyberArk, or AWS IAM.
Strong scripting and automation skills (Python, Bash, PowerShell).
Experience with CI/CD pipelines and DevOps workflows.
Familiarity with FedRAMP, NIST 800-53, DoD IL 4/5 compliance standards.
Hands-on experience with VDI management, performance tuning, and security hardening.

💡 Responsibilities

Architect, deploy, and maintain highly available, scalable, and secure systems in AWS GovCloud (Azure and GCP experience is a plus).
Automate infrastructure provisioning, scaling, and failover using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Implement SLOs, SLIs, and error budgets to drive reliability improvements.
Optimize cloud infrastructure for performance, cost-efficiency, and resilience while adhering to compliance requirements.
Manage and optimize Virtual Desktop Infrastructure (VDI) solutions to ensure seamless user experience, performance, and security.
Deploy and manage monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
Implement automated self-healing mechanisms and proactive monitoring solutions.
Lead incident response, postmortems, and root cause analysis (RCA) to prevent future system disruptions.
Ensure 24/7 system uptime through on-call rotation and escalation handling.
Implement Identity and Access Management (IAM) best practices, including SSO, MFA, and RBAC across cloud environments.
Automate IAM governance and Privileged Access Management (PAM) to enforce the principle of least privilege.
Ensure audit readiness by maintaining accurate security configurations, logs, and compliance reports.
Work with security teams to align IAM and Zero Trust Architecture (ZTA) strategies with organizational policies.
Develop and maintain CI/CD pipelines for automated deployments and configuration management.
Use Python, Bash, or PowerShell to automate routine SRE workflows and security compliance checks.
Implement immutable infrastructure and support DevSecOps best practices.
Manage and optimize VDI environments, ensuring seamless DevOps integration for development and operational teams.
Contribute to chaos engineering and failure injection testing to enhance system resiliency.
Work closely with DevOps, IT Security, and Compliance teams to ensure system integrity and uptime.
Provide mentorship to junior engineers and contribute to knowledge-sharing initiatives.
Participate in architectural discussions and help drive improvements in cloud reliability and security posture.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzureGrafanaPrometheusCI/CDLinuxDevOpsTerraformComplianceAnsibleScripting

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 6 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 6 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 6 days ago

📍 United States

💸 140000.0 - 160000.0 USD per year

🔧 Requirements

Expertise with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Strong background in Reliability Engineering, DevOps, Software Engineering
Fluency with least one programming language, such as C#, Python, Go, etc
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git.
Ability to maintain discretion, handle sensitive information, and improve security best-practices

💡 Responsibilities

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Architectural designs and engineering operations at scale
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs
Evolve and influence Bitwarden's SDLC as we scale
Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformSoftware EngineeringSaaS

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 7 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer - Identity Platform

Posted 7 days ago

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

5+ years of experience building, iterating upon, and maintaining corporate IAM systems
5+ years of experience with operational procedures and application development
Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
Strong understanding of CI/CD workflows, automation frameworks, and best practices
Clear communication—demonstrate ability to explain technical concepts simply
Self starter—possess a continuous learning mindset
Demonstrate critical thinking under pressure

💡 Responsibilities

Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
Deliver configurations and maintain state using configuration management tools
Facilitate incident response, conduct root cause analysis, and blameless retrospectives
Define metrics and bolster monitoring/observability across corporate IAM systems
Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 8 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 8 days ago

Apply

🔥 Site Reliability Engineer

Posted 11 days ago

📍 United States

🧭 Full-Time

💸 175000.0 - 220000.0 USD per year

🔍 Software Development

🏢 Company: Orca👥 11-50💰 $18,000,000 Series A over 3 years agoCryptocurrency Blockchain Online Portals Information Technology

🔧 Requirements

Extensive experience with AWS services (e.g., ECS, copilot, Cloudwatch) and the ability to troubleshoot and optimize cloud-based systems.
Hands-on experience with tools like GitHub Action for reliable and efficient deployment workflows.
Familiarity with tools like Datadog to build actionable monitoring and alerting systems.
Proficiency in infrastructure-as-code tools like Terraform, and containerization tools like Docker.
Experience with orchestrators like Kubernetes or Airflow is a plus.

💡 Responsibilities

Design, manage, and optimize AWS infrastructure with a focus on scalability, reliability, and cost efficiency.
Build and refine CI/CD processes using modern tools, ensuring seamless, secure, and efficient deployments.
Develop robust monitoring, logging, and alerting systems using tools like Datadog or Grafana to improve visibility and system performance.
Architect systems that handle growth effortlessly, minimize downtime, and maintain high performance.
Implement effective alerting mechanisms to prioritize and address critical issues proactively.
Optimize and document infrastructure processes, leveraging tools like Terraform, Docker, and Airflow to create scalable and maintainable systems.
Partner with engineering teams to design and refine infrastructure that powers features like real-time monitoring, automated transaction execution, and analytics.

AWSDockerPostgreSQLKubernetesAirflowGrafanaRustCI/CDLinuxDevOpsTerraform

Posted 11 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 12 days ago

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

🔧 Requirements

5+ years of experience as a Site Reliability Engineer or similar role.
Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
Strong problem-solving and analytical skills.
Excellent communication and collaboration skills.
Ability to work independently and take ownership of complex tasks.
Passion for technology and a desire to learn and grow.
Experience with Jenkins, PostgreSQL, and MongoDB.
Experience with cloud cost optimization, security best practices and tools.
Experience working in a fast-paced, agile environment.
Experience Rancher, Cattleprod, and TeamCity a plus.

💡 Responsibilities

Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
Implement automated monitoring and alerting systems for early detection of potential problems.
Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
Help onboard and mentor junior team members, sharing your knowledge and expertise.
Stay up to date on the latest cloud technologies and best practices for SRE.
Participate in a well-structured on-call rotation with other Site Reliability Engineers.
Explore new technologies and innovative solutions to improve service quality and speed to market.
Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 12 days ago

Apply