Site Reliability Engineer

Posted 19 days agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: AMER, EMEA, APAC

🔍 Industry: Blockchain

🏢 Company: asymmetric.re

⏳ Experience: 5+ years

🪄 Skills: AWSDockerPythonBlockchainCloud ComputingKubernetesRustCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Requirements:

Excellent experience managing Linux and network infrastructure.
Experience with load balancers and other high-availability technologies (e.g., HAproxy, ALB/ELB, etc.)
Prior experience with configuration management tooling (eg. Ansible, Chef, Puppet, Saltstack, etc.)
Excellent troubleshooting fundamentals on both hardware and software.
Development experience in Golang, Python, or Rust.
Experience with continuous integration pipelines and automated deployments
Experience OSS monitoring tools (eg. Grafana, Loki, Prometheus, Alertmanager)

Responsibilities:

Manage a globally distributed fleet of blockchain infrastructure services
Deploy infrastructure as code deployments to both dev, staging, and production environments
Work in a globally distributed high performing team to deliver mission-critical services to the financial sector.
Design, Architect, Deploy, and Manage blockchain infrastructure services.
Adhere to the highest standards of integrity, trust, and professionalism.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 3 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 3 days ago

Apply

🔥 Sr. Site Reliability Engineer - GovCloud (Remote)

Posted 4 days ago

📍 United States

🧭 Full-Time

💸 95000.0 - 160000.0 USD per year

🔍 Cybersecurity

🏢 Company: crowdstrikecareers

🔧 Requirements

5-7+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure roles.
Experience managing Virtual Desktop Infrastructure (VDI) solutions such as Citrix, VMware Horizon, or AWS WorkSpaces.
Hands-on experience with AWS GovCloud (Azure/GCP is a plus).
Strong expertise in Infrastructure as Code (Terraform, CloudFormation).
Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog, Splunk).
Expertise in IAM and PAM solutions such as Okta, CyberArk, or AWS IAM.
Strong scripting and automation skills (Python, Bash, PowerShell).
Experience with CI/CD pipelines and DevOps workflows.
Familiarity with FedRAMP, NIST 800-53, DoD IL 4/5 compliance standards.
Hands-on experience with VDI management, performance tuning, and security hardening.

💡 Responsibilities

Architect, deploy, and maintain highly available, scalable, and secure systems in AWS GovCloud (Azure and GCP experience is a plus).
Automate infrastructure provisioning, scaling, and failover using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Implement SLOs, SLIs, and error budgets to drive reliability improvements.
Optimize cloud infrastructure for performance, cost-efficiency, and resilience while adhering to compliance requirements.
Manage and optimize Virtual Desktop Infrastructure (VDI) solutions to ensure seamless user experience, performance, and security.
Deploy and manage monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
Implement automated self-healing mechanisms and proactive monitoring solutions.
Lead incident response, postmortems, and root cause analysis (RCA) to prevent future system disruptions.
Ensure 24/7 system uptime through on-call rotation and escalation handling.
Implement Identity and Access Management (IAM) best practices, including SSO, MFA, and RBAC across cloud environments.
Automate IAM governance and Privileged Access Management (PAM) to enforce the principle of least privilege.
Ensure audit readiness by maintaining accurate security configurations, logs, and compliance reports.
Work with security teams to align IAM and Zero Trust Architecture (ZTA) strategies with organizational policies.
Develop and maintain CI/CD pipelines for automated deployments and configuration management.
Use Python, Bash, or PowerShell to automate routine SRE workflows and security compliance checks.
Implement immutable infrastructure and support DevSecOps best practices.
Manage and optimize VDI environments, ensuring seamless DevOps integration for development and operational teams.
Contribute to chaos engineering and failure injection testing to enhance system resiliency.
Work closely with DevOps, IT Security, and Compliance teams to ensure system integrity and uptime.
Provide mentorship to junior engineers and contribute to knowledge-sharing initiatives.
Participate in architectural discussions and help drive improvements in cloud reliability and security posture.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzureGrafanaPrometheusCI/CDLinuxDevOpsTerraformComplianceAnsibleScripting

Posted 4 days ago

Apply

🔥 Site Reliability Engineer IOE: Cardano

Posted 5 days ago

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

Proficiency in Python, Bash, Terraform, Nix for DevOps services.
Extensive experience with AWS, specifically with services like EKS and RDS.
Familiarity with Container orchestration (e.g. Kubernetes) is essential.
Hands-on experience with PostgreSQL and its deployment on RDS.
Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
Solid troubleshooting and performance tuning capabilities.
Exceptional communication skills and team collaboration ethic.
Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).

💡 Responsibilities

Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
Practice sustainable incident response and promote blameless postmortems.
Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
Analyze system performance and reliability, offering recommendations for enhancement.
Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 5 days ago

Apply

🔥 Site Reliability Engineer

Posted 5 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 5 days ago

📍 United States

💸 140000.0 - 160000.0 USD per year

🔧 Requirements

Expertise with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Strong background in Reliability Engineering, DevOps, Software Engineering
Fluency with least one programming language, such as C#, Python, Go, etc
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git.
Ability to maintain discretion, handle sensitive information, and improve security best-practices

💡 Responsibilities

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Architectural designs and engineering operations at scale
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs
Evolve and influence Bitwarden's SDLC as we scale
Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformSoftware EngineeringSaaS

Posted 5 days ago

Apply

🔥 Site Reliability Engineer

Posted 5 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer - Identity Platform

Posted 6 days ago

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

5+ years of experience building, iterating upon, and maintaining corporate IAM systems
5+ years of experience with operational procedures and application development
Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
Strong understanding of CI/CD workflows, automation frameworks, and best practices
Clear communication—demonstrate ability to explain technical concepts simply
Self starter—possess a continuous learning mindset
Demonstrate critical thinking under pressure

💡 Responsibilities

Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
Deliver configurations and maintain state using configuration management tools
Facilitate incident response, conduct root cause analysis, and blameless retrospectives
Define metrics and bolster monitoring/observability across corporate IAM systems
Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 6 days ago

📍 France

🏢 Company: Sinch👥 1001-5000💰 $48,845,918 Post-IPO Debt 7 months agoMessaging SaaS Telecommunications Mobile Software

🔧 Requirements

Background in infrastructure, operations, or software engineering.
Experience with cloud providers such as GCP.
Proficiency in configuration management tools such as Terraform and Ansible.
Hands-on proficiency with modern monitoring tools like Prometheus and Grafana.
Experience with distributed data stores such as Cassandra, PostgreSQL, and ElasticSearch.
Experience with Python and Bash is beneficial.
Strong technical skills across various infrastructure technologies.
Strong communication skills.
Experience operating and maintaining production systems in a Linux and public cloud environment.

💡 Responsibilities

Partner with product engineering teams to identity systems requirements.
Build and support our cloud-based infrastructure.
Automate routine processes and remediation tasks.
Develop, monitor and track Service Level Objectives (SLOs) for the systems under management.
Proactively troubleshoot, resolve, and plan for issues that typically come from support staff, other engineering teams, and our automated monitoring system.
Ensure our datastores are healthy and operate at optimal performance levels.
Contribute to the growth and culture of our engineering team.

DockerPostgreSQLPythonBashElasticSearchGCPKubernetesCassandraGrafanaPrometheusLinuxTerraformAnsible

Posted 6 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 6 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 10 days ago

📍 United States

🧭 Full-Time

💸 175000.0 - 220000.0 USDC per year

🔍 Software Development

🏢 Company: Orca👥 11-50💰 $18,000,000 Series A over 3 years agoCryptocurrency Blockchain Online Portals Information Technology

🔧 Requirements

A strong track record of working on high-performance, scalable systems with expertise in release engineering, infrastructure, and operations.
Extensive experience with AWS services (e.g., ECS, copilot, Cloudwatch) and the ability to troubleshoot and optimize cloud-based systems.
Hands-on experience with tools like GitHub Action for reliable and efficient deployment workflows.
Familiarity with tools like Datadog to build actionable monitoring and alerting systems.
Proficiency in infrastructure-as-code tools like Terraform, and containerization tools like Docker. Experience with orchestrators like Kubernetes or Airflow is a plus.
Comfortable working independently in an async environment while collaborating effectively with a team. You understand trade-offs and advocate for pragmatic solutions.
Familiarity with Decentralized Finance (DeFi) concepts, AMMs, and the Solana ecosystem is a plus but not required.

💡 Responsibilities

Design, manage, and optimize AWS infrastructure with a focus on scalability, reliability, and cost efficiency.
Triage and resolve critical infrastructure issues proactively.
Build and refine CI/CD processes using modern tools, ensuring seamless, secure, and efficient deployments.
Develop robust monitoring, logging, and alerting systems using tools like Datadog or Grafana to improve visibility and system performance.
Architect systems that handle growth effortlessly, minimize downtime, and maintain high performance.
Implement effective alerting mechanisms to prioritize and address critical issues proactively.
Optimize and document infrastructure processes, leveraging tools like Terraform, Docker, and Airflow to create scalable and maintainable systems.
Partner with engineering teams to design and refine infrastructure that powers features like real-time monitoring, automated transaction execution, and analytics.

AWSDockerPostgreSQLKubernetesAirflowGrafanaRustCI/CDLinuxDevOpsTerraform

Posted 10 days ago

Apply