Apply

Site Reliability Engineer

Posted 5 days agoViewed

View full description

📍 Location: France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🔍 Industry: Software Development

🏢 Company: Remote Woman

🗣️ Languages: English

🪄 Skills: AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Requirements:
  • A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
Responsibilities:
  • Refine Monitoring and Observability
  • Automate Deployments and Workflows
  • Optimize CI/CD Pipelines
  • Cloud Infrastructure Management
  • Incident Response and Post-Mortem
  • Collaborate with Cross-Functional Teams
  • Drive Technical Innovation
Apply

Related Jobs

Apply

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

  • 5+ years of Ruby/Rail Experience
  • 3+ years of AWS Experience
  • Kubernetes experience
  • Experience with profiling and benchmarking source code
  • Effective at code review, and identifying potential performance problems before they reach production
  • Experience with Datadog or other APM tools
  • Excellent written and verbal communication skills
  • Proactively identify, triage, and resolve performance issues
  • Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
  • Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
  • Optimize performance through instance configuration and monitoring
  • Collaborate with other SREs to proactively identify and address performance bottlenecks
  • Lead database capacity planning and upgrade initiatives
  • Manage the database-specific components of disaster recovery planning and execution
  • Oversee backup systems and pre-production databases
  • Create and maintain infrastructure and operations documentation
  • Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 3 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 95000.0 - 160000.0 USD per year

🔍 Cybersecurity

🏢 Company: crowdstrikecareers

  • 5-7+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure roles.
  • Experience managing Virtual Desktop Infrastructure (VDI) solutions such as Citrix, VMware Horizon, or AWS WorkSpaces.
  • Hands-on experience with AWS GovCloud (Azure/GCP is a plus).
  • Strong expertise in Infrastructure as Code (Terraform, CloudFormation).
  • Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog, Splunk).
  • Expertise in IAM and PAM solutions such as Okta, CyberArk, or AWS IAM.
  • Strong scripting and automation skills (Python, Bash, PowerShell).
  • Experience with CI/CD pipelines and DevOps workflows.
  • Familiarity with FedRAMP, NIST 800-53, DoD IL 4/5 compliance standards.
  • Hands-on experience with VDI management, performance tuning, and security hardening.
  • Architect, deploy, and maintain highly available, scalable, and secure systems in AWS GovCloud (Azure and GCP experience is a plus).
  • Automate infrastructure provisioning, scaling, and failover using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
  • Implement SLOs, SLIs, and error budgets to drive reliability improvements.
  • Optimize cloud infrastructure for performance, cost-efficiency, and resilience while adhering to compliance requirements.
  • Manage and optimize Virtual Desktop Infrastructure (VDI) solutions to ensure seamless user experience, performance, and security.
  • Deploy and manage monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
  • Implement automated self-healing mechanisms and proactive monitoring solutions.
  • Lead incident response, postmortems, and root cause analysis (RCA) to prevent future system disruptions.
  • Ensure 24/7 system uptime through on-call rotation and escalation handling.
  • Implement Identity and Access Management (IAM) best practices, including SSO, MFA, and RBAC across cloud environments.
  • Automate IAM governance and Privileged Access Management (PAM) to enforce the principle of least privilege.
  • Ensure audit readiness by maintaining accurate security configurations, logs, and compliance reports.
  • Work with security teams to align IAM and Zero Trust Architecture (ZTA) strategies with organizational policies.
  • Develop and maintain CI/CD pipelines for automated deployments and configuration management.
  • Use Python, Bash, or PowerShell to automate routine SRE workflows and security compliance checks.
  • Implement immutable infrastructure and support DevSecOps best practices.
  • Manage and optimize VDI environments, ensuring seamless DevOps integration for development and operational teams.
  • Contribute to chaos engineering and failure injection testing to enhance system resiliency.
  • Work closely with DevOps, IT Security, and Compliance teams to ensure system integrity and uptime.
  • Provide mentorship to junior engineers and contribute to knowledge-sharing initiatives.
  • Participate in architectural discussions and help drive improvements in cloud reliability and security posture.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzureGrafanaPrometheusCI/CDLinuxDevOpsTerraformComplianceAnsibleScripting

Posted 4 days ago
Apply
Apply

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

  • Proficiency in Python, Bash, Terraform, Nix for DevOps services.
  • Extensive experience with AWS, specifically with services like EKS and RDS.
  • Familiarity with Container orchestration (e.g. Kubernetes) is essential.
  • Hands-on experience with PostgreSQL and its deployment on RDS.
  • Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
  • Solid troubleshooting and performance tuning capabilities.
  • Exceptional communication skills and team collaboration ethic.
  • Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).
  • Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
  • Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
  • Practice sustainable incident response and promote blameless postmortems.
  • Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
  • Analyze system performance and reliability, offering recommendations for enhancement.
  • Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
  • Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 5 days ago
Apply
Apply

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternetOpen SourcePaaSCloud ManagementSoftware

  • DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
  • Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
  • Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
  • Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
  • Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
  • Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
  • Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 5 days ago
Apply
Apply

📍 United States

💸 140000.0 - 160000.0 USD per year

  • Expertise with multi-region deployments in public cloud environments
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
  • Strong background in Reliability Engineering, DevOps, Software Engineering
  • Fluency with least one programming language, such as C#, Python, Go, etc
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
  • Proficiency using source control such as Git.
  • Ability to maintain discretion, handle sensitive information, and improve security best-practices
  • Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
  • Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
  • Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
  • Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
  • Architectural designs and engineering operations at scale
  • Active participation in code reviews, learning and spreading technical knowledge
  • Contribute and mature incident management/escalation processes
  • Collaborate with cross functional teams to refine priorities and deliverables
  • Ongoing engagement with product owners to align SLI/SLOs/SLAs
  • Evaluate and identify opportunities for new initiatives to support organizational needs
  • Evolve and influence Bitwarden's SDLC as we scale
  • Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformSoftware EngineeringSaaS

Posted 5 days ago
Apply
Apply

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

  • 5+ years of experience building, iterating upon, and maintaining corporate IAM systems
  • 5+ years of experience with operational procedures and application development
  • Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
  • Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
  • Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
  • Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
  • Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
  • Strong understanding of CI/CD workflows, automation frameworks, and best practices
  • Clear communication—demonstrate ability to explain technical concepts simply
  • Self starter—possess a continuous learning mindset
  • Demonstrate critical thinking under pressure
  • Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
  • Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
  • Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
  • Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
  • Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
  • Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
  • Deliver configurations and maintain state using configuration management tools
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives
  • Define metrics and bolster monitoring/observability across corporate IAM systems
  • Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 6 days ago
Apply
Apply

📍 France

🏢 Company: Sinch👥 1001-5000💰 $48,845,918 Post-IPO Debt 7 months agoMessagingSaaSTelecommunicationsMobileSoftware

  • Background in infrastructure, operations, or software engineering.
  • Experience with cloud providers such as GCP.
  • Proficiency in configuration management tools such as Terraform and Ansible.
  • Hands-on proficiency with modern monitoring tools like Prometheus and Grafana.
  • Experience with distributed data stores such as Cassandra, PostgreSQL, and ElasticSearch.
  • Experience with Python and Bash is beneficial.
  • Strong technical skills across various infrastructure technologies.
  • Strong communication skills.
  • Experience operating and maintaining production systems in a Linux and public cloud environment.
  • Partner with product engineering teams to identity systems requirements.
  • Build and support our cloud-based infrastructure.
  • Automate routine processes and remediation tasks.
  • Develop, monitor and track Service Level Objectives (SLOs) for the systems under management.
  • Proactively troubleshoot, resolve, and plan for issues that typically come from support staff, other engineering teams, and our automated monitoring system.
  • Ensure our datastores are healthy and operate at optimal performance levels.
  • Contribute to the growth and culture of our engineering team.

DockerPostgreSQLPythonBashElasticSearchGCPKubernetesCassandraGrafanaPrometheusLinuxTerraformAnsible

Posted 6 days ago
Apply
Apply

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 6 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 175000.0 - 220000.0 USDC per year

🔍 Software Development

🏢 Company: Orca👥 11-50💰 $18,000,000 Series A over 3 years agoCryptocurrencyBlockchainOnline PortalsInformation Technology

  • A strong track record of working on high-performance, scalable systems with expertise in release engineering, infrastructure, and operations.
  • Extensive experience with AWS services (e.g., ECS, copilot, Cloudwatch) and the ability to troubleshoot and optimize cloud-based systems.
  • Hands-on experience with tools like GitHub Action for reliable and efficient deployment workflows.
  • Familiarity with tools like Datadog to build actionable monitoring and alerting systems.
  • Proficiency in infrastructure-as-code tools like Terraform, and containerization tools like Docker. Experience with orchestrators like Kubernetes or Airflow is a plus.
  • Comfortable working independently in an async environment while collaborating effectively with a team. You understand trade-offs and advocate for pragmatic solutions.
  • Familiarity with Decentralized Finance (DeFi) concepts, AMMs, and the Solana ecosystem is a plus but not required.
  • Design, manage, and optimize AWS infrastructure with a focus on scalability, reliability, and cost efficiency.
  • Triage and resolve critical infrastructure issues proactively.
  • Build and refine CI/CD processes using modern tools, ensuring seamless, secure, and efficient deployments.
  • Develop robust monitoring, logging, and alerting systems using tools like Datadog or Grafana to improve visibility and system performance.
  • Architect systems that handle growth effortlessly, minimize downtime, and maintain high performance.
  • Implement effective alerting mechanisms to prioritize and address critical issues proactively.
  • Optimize and document infrastructure processes, leveraging tools like Terraform, Docker, and Airflow to create scalable and maintainable systems.
  • Partner with engineering teams to design and refine infrastructure that powers features like real-time monitoring, automated transaction execution, and analytics.

AWSDockerPostgreSQLKubernetesAirflowGrafanaRustCI/CDLinuxDevOpsTerraform

Posted 10 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

  • 5+ years of experience as a Site Reliability Engineer or similar role.
  • Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
  • Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
  • Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
  • Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
  • Strong problem-solving and analytical skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and take ownership of complex tasks.
  • Passion for technology and a desire to learn and grow.
  • Experience with Jenkins, PostgreSQL, and MongoDB.
  • Experience with cloud cost optimization, security best practices and tools.
  • Experience working in a fast-paced, agile environment.
  • Experience Rancher, Cattleprod, and TeamCity a plus.
  • Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
  • Implement automated monitoring and alerting systems for early detection of potential problems.
  • Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
  • Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
  • Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
  • Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
  • Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
  • Help onboard and mentor junior team members, sharing your knowledge and expertise.
  • Stay up to date on the latest cloud technologies and best practices for SRE.
  • Participate in a well-structured on-call rotation with other Site Reliability Engineers.
  • Explore new technologies and innovative solutions to improve service quality and speed to market.
  • Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 10 days ago
Apply