Apply

Site Reliability Engineer

Posted 5 days agoViewed

View full description

Requirements:
  • Supporting and troubleshooting.
  • Using automation and configuration management tools (Octopus, Team City, Terraform)
  • AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments
  • Cloud Application Load Balancer, preferably with experience on AWS ALB
  • Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS
  • Serverless Computing such as AWS Lambda
  • Cloud Firewall such as AWS WAF
  • Server virtualisation such as VMware, IaaS and PaaS cloud such as AWS and Azure
  • Open-source monitoring and alerting tools (Prometheus, Loki, Grafana and Jaeger)
  • Scripting in Python, Bash, Powershell or others
  • Microsoft SQL databases via Stored Procedures, Locking/Unlocking tables and running select statements to assess impact and diagnose problems
Responsibilities:
  • Being the first point of technical escalation of issues within our infrastructure both in cloud and on-prem.
  • Participating in stand-ups with the development teams and informing your squad of updates and changes to our platform.
  • Automating everything – Workflow and tool automation - such as deployments of distributed applications and infrastructure using various scripting languages to allow our 24/7 Incident Engineers to mitigate incidents without escalation.
  • Able to analyse, diagnose and solve issues in the production environment with minimal number of escalations to supporting 3rd Level support teams.
  • Participate in Change Management process via review of RFC’s to ensure “Definition of Done” as well as executing and supporting software and hardware deployments.
  • Developing and Documenting ways-of-working between the LiveOps(NOC) Team and the development teams to improve efficiencies in diagnostics and impact mitigation.
Apply

Related Jobs

Apply
🔥 Mid Site Reliability Engineer
Posted about 8 hours ago

📍 Brazil, Mexico, Peru

🧭 Full-Time

🔍 Software Development

🏢 Company: Zipdev👥 11-50Web DevelopmentWeb DesignSoftware

  • 3 -4 years of proven professional experience as a Site Reliability Engineer.
  • Experience with one or more general-purpose programming/scripting languages including but not limited to: Python, Bash, Perl or Go.
  • Fundamental knowledge of technologies across a broad range of disciplines: virtualization storage, networking, server, and security
  • Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
  • Experience in analyzing logs and troubleshooting large-scale distributed systems.
  • Build systems and infrastructure to monitor complex, large-scale distributed systems
  • Identify stability/performance issues and collaborate with developers to triage critical issues in production systems.
  • Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
  • Devise ways to actively monitor system throughput, capacity and reliability.
  • Ability to debug complex systems and evolve a running environment without downtime.
  • Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
  • Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.

AWSDockerPythonBashCloud ComputingElasticSearchKubernetes*NixZabbixAlgorithmsData StructuresREST APICI/CDLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingJSONAnsibleScriptingDebugging

Posted about 8 hours ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

  • 3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
  • Experience with AWS, Azure, or GCP cloud infrastructure
  • Experience with PHP and Javascript/Typescript
  • Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.
  • Be the escalation point for problems and incidents for our Customer Support teams.
  • Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
  • Internally communicate the status of problems and incidents.
  • Generate Root Cause Analysis (RCA) statements for internal and external use.
  • Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
  • Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
  • Proactively explore and drive improvements to the overall quality and reliability of our software platform.
  • Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 1 day ago
Apply
Apply

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

  • 5+ years of Ruby/Rail Experience
  • 3+ years of AWS Experience
  • Kubernetes experience
  • Experience with profiling and benchmarking source code
  • Effective at code review, and identifying potential performance problems before they reach production
  • Experience with Datadog or other APM tools
  • Excellent written and verbal communication skills
  • Proactively identify, triage, and resolve performance issues
  • Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
  • Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
  • Optimize performance through instance configuration and monitoring
  • Collaborate with other SREs to proactively identify and address performance bottlenecks
  • Lead database capacity planning and upgrade initiatives
  • Manage the database-specific components of disaster recovery planning and execution
  • Oversee backup systems and pre-production databases
  • Create and maintain infrastructure and operations documentation
  • Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 5 days ago
Apply
Apply

🏢 Company: DeepSource Technologies

  • Expertise in Google Cloud networking, Compute Engine, Kubernetes (GKE), Cloud Functions, and Cloud Storage.
  • Strong knowledge of Terraform, Ansible, or other Infrastructure as Code (IaC) tools.
  • Experience with Google Kubernetes Engine (GKE), microservices, and container orchestration.
  • Hands-on experience with FinOps tools and cost optimization strategies in cloud environments.
  • Familiarity with monitoring and logging solutions such as Google Operations Suite (formerly Stackdriver), Prometheus, Grafana.
  • Experience with CI/CD pipelines, automation, and GitOps best practices.
  • Strong understanding of SRE principles, SLAs, SLOs, and error budgets.
  • Manage and maintain GCP infrastructure, ensuring high availability, scalability, and system reliability.
  • Monitor and forecast resource utilization, performance trends, and infrastructure scaling needs to optimize cloud costs and efficiency.
  • Design and implement highly available, fault-tolerant, and resilient cloud architectures, leveraging Infrastructure as Code (IaC) tools such as Terraform and Ansible.
  • Utilize Google Cloud Monitoring, Cloud Logging, and third-party tools to proactively detect and resolve performance issues.
  • Analyze and optimize cloud spending, implement cost controls, recommend rightsizing strategies, and ensure efficient resource allocation.
  • Implement best practices for IAM, network security, encryption, and compliance frameworks (SOC2, ISO 27001, NIST).
  • Collaborate with DevOps teams to streamline deployment processes, automate workflows, and optimize application performance.
  • Design and implement disaster recovery (DR) plans, backup strategies, and failover mechanisms to ensure business continuity.
  • Maintain comprehensive documentation of infrastructure, best practices, and optimization strategies while working closely with cross-functional teams.
Posted 5 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 95000.0 - 160000.0 USD per year

🔍 Cybersecurity

🏢 Company: crowdstrikecareers

  • 5-7+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure roles.
  • Experience managing Virtual Desktop Infrastructure (VDI) solutions such as Citrix, VMware Horizon, or AWS WorkSpaces.
  • Hands-on experience with AWS GovCloud (Azure/GCP is a plus).
  • Strong expertise in Infrastructure as Code (Terraform, CloudFormation).
  • Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog, Splunk).
  • Expertise in IAM and PAM solutions such as Okta, CyberArk, or AWS IAM.
  • Strong scripting and automation skills (Python, Bash, PowerShell).
  • Experience with CI/CD pipelines and DevOps workflows.
  • Familiarity with FedRAMP, NIST 800-53, DoD IL 4/5 compliance standards.
  • Hands-on experience with VDI management, performance tuning, and security hardening.
  • Architect, deploy, and maintain highly available, scalable, and secure systems in AWS GovCloud (Azure and GCP experience is a plus).
  • Automate infrastructure provisioning, scaling, and failover using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
  • Implement SLOs, SLIs, and error budgets to drive reliability improvements.
  • Optimize cloud infrastructure for performance, cost-efficiency, and resilience while adhering to compliance requirements.
  • Manage and optimize Virtual Desktop Infrastructure (VDI) solutions to ensure seamless user experience, performance, and security.
  • Deploy and manage monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
  • Implement automated self-healing mechanisms and proactive monitoring solutions.
  • Lead incident response, postmortems, and root cause analysis (RCA) to prevent future system disruptions.
  • Ensure 24/7 system uptime through on-call rotation and escalation handling.
  • Implement Identity and Access Management (IAM) best practices, including SSO, MFA, and RBAC across cloud environments.
  • Automate IAM governance and Privileged Access Management (PAM) to enforce the principle of least privilege.
  • Ensure audit readiness by maintaining accurate security configurations, logs, and compliance reports.
  • Work with security teams to align IAM and Zero Trust Architecture (ZTA) strategies with organizational policies.
  • Develop and maintain CI/CD pipelines for automated deployments and configuration management.
  • Use Python, Bash, or PowerShell to automate routine SRE workflows and security compliance checks.
  • Implement immutable infrastructure and support DevSecOps best practices.
  • Manage and optimize VDI environments, ensuring seamless DevOps integration for development and operational teams.
  • Contribute to chaos engineering and failure injection testing to enhance system resiliency.
  • Work closely with DevOps, IT Security, and Compliance teams to ensure system integrity and uptime.
  • Provide mentorship to junior engineers and contribute to knowledge-sharing initiatives.
  • Participate in architectural discussions and help drive improvements in cloud reliability and security posture.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzureGrafanaPrometheusCI/CDLinuxDevOpsTerraformComplianceAnsibleScripting

Posted 6 days ago
Apply
Apply

🧭 Full-Time

💸 159000.0 - 215000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
  • Strong experience with public cloud, container and orchestration technologies including AWS, GCP, Azure, Kubernetes, and Docker
  • Solid programming and automation skills (Bash, Python, Go) including experience working  with configuration management (infrastructure as code) platforms such as Terraform, Ansible, and Puppet
  • Experience working with *nix system command line (e.g. ssh, grep, awk)
  • Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
  • Networking administration experience: concepts such as routing, firewalls (iptables),  peering sound familiar
  • A passion for documenting code, processes, and infrastructure in runbooks and wikis
  • Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
  • Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships
  • Ensure our real-time, scalable, infrastructure is set up for growth and working efficiently.
  • Work on tools and processes to better monitor our platform as well as ensuring  its stability through our rapid growth
  • Deep-dive into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
  • Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
  • Assist with expanding our cloud deployments across the major cloud providers
  • Contribute code, code reviews and tools or patches to all kinds of existing code
  • Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
  • Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team
Posted 6 days ago
Apply
Apply

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

  • Proficiency in Python, Bash, Terraform, Nix for DevOps services.
  • Extensive experience with AWS, specifically with services like EKS and RDS.
  • Familiarity with Container orchestration (e.g. Kubernetes) is essential.
  • Hands-on experience with PostgreSQL and its deployment on RDS.
  • Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
  • Solid troubleshooting and performance tuning capabilities.
  • Exceptional communication skills and team collaboration ethic.
  • Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).
  • Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
  • Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
  • Practice sustainable incident response and promote blameless postmortems.
  • Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
  • Analyze system performance and reliability, offering recommendations for enhancement.
  • Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
  • Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 6 days ago
Apply
Apply

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternetOpen SourcePaaSCloud ManagementSoftware

  • DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
  • Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
  • Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
  • Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
  • Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
  • Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
  • Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 6 days ago
Apply
Apply

📍 United States

💸 140000.0 - 160000.0 USD per year

  • Expertise with multi-region deployments in public cloud environments
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
  • Strong background in Reliability Engineering, DevOps, Software Engineering
  • Fluency with least one programming language, such as C#, Python, Go, etc
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
  • Proficiency using source control such as Git.
  • Ability to maintain discretion, handle sensitive information, and improve security best-practices
  • Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
  • Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
  • Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
  • Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
  • Architectural designs and engineering operations at scale
  • Active participation in code reviews, learning and spreading technical knowledge
  • Contribute and mature incident management/escalation processes
  • Collaborate with cross functional teams to refine priorities and deliverables
  • Ongoing engagement with product owners to align SLI/SLOs/SLAs
  • Evaluate and identify opportunities for new initiatives to support organizational needs
  • Evolve and influence Bitwarden's SDLC as we scale
  • Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformSoftware EngineeringSaaS

Posted 6 days ago
Apply
Apply

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

  • A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability
  • Automate Deployments and Workflows
  • Optimize CI/CD Pipelines
  • Cloud Infrastructure Management
  • Incident Response and Post-Mortem
  • Collaborate with Cross-Functional Teams
  • Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 7 days ago
Apply

Related Articles

Posted about 1 month ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.