Site Reliability Engineer

Posted 5 days agoViewed

Requirements:

Supporting and troubleshooting.

Using automation and configuration management tools (Octopus, Team City, Terraform)

AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments

Cloud Application Load Balancer, preferably with experience on AWS ALB

Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS

Serverless Computing such as AWS Lambda

Cloud Firewall such as AWS WAF

Server virtualisation such as VMware, IaaS and PaaS cloud such as AWS and Azure

Open-source monitoring and alerting tools (Prometheus, Loki, Grafana and Jaeger)

Scripting in Python, Bash, Powershell or others

Microsoft SQL databases via Stored Procedures, Locking/Unlocking tables and running select statements to assess impact and diagnose problems

Responsibilities:

Being the first point of technical escalation of issues within our infrastructure both in cloud and on-prem.

Participating in stand-ups with the development teams and informing your squad of updates and changes to our platform.

Automating everything – Workflow and tool automation - such as deployments of distributed applications and infrastructure using various scripting languages to allow our 24/7 Incident Engineers to mitigate incidents without escalation.

Able to analyse, diagnose and solve issues in the production environment with minimal number of escalations to supporting 3rd Level support teams.

Participate in Change Management process via review of RFC’s to ensure “Definition of Done” as well as executing and supporting software and hardware deployments.

Developing and Documenting ways-of-working between the LiveOps(NOC) Team and the development teams to improve efficiencies in diagnostics and impact mitigation.

Apply

Related Jobs

Apply

🔥 Mid Site Reliability Engineer

Posted about 8 hours ago

📍 Brazil, Mexico, Peru

🧭 Full-Time

🔍 Software Development

🏢 Company: Zipdev👥 11-50 Web Development Web Design Software

🔧 Requirements

3 -4 years of proven professional experience as a Site Reliability Engineer.
Experience with one or more general-purpose programming/scripting languages including but not limited to: Python, Bash, Perl or Go.
Fundamental knowledge of technologies across a broad range of disciplines: virtualization storage, networking, server, and security
Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
Experience in analyzing logs and troubleshooting large-scale distributed systems.

💡 Responsibilities

Build systems and infrastructure to monitor complex, large-scale distributed systems
Identify stability/performance issues and collaborate with developers to triage critical issues in production systems.
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Devise ways to actively monitor system throughput, capacity and reliability.
Ability to debug complex systems and evolve a running environment without downtime.
Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.

AWSDockerPythonBashCloud ComputingElasticSearchKubernetes*NixZabbixAlgorithmsData StructuresREST APICI/CDLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingJSONAnsibleScriptingDebugging

Posted about 8 hours ago

Apply

🔥 Site Reliability Engineer (CST or EST Remote)

Posted 1 day ago

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

🔧 Requirements

3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
Experience with AWS, Azure, or GCP cloud infrastructure
Experience with PHP and Javascript/Typescript
Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.

💡 Responsibilities

Be the escalation point for problems and incidents for our Customer Support teams.
Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
Internally communicate the status of problems and incidents.
Generate Root Cause Analysis (RCA) statements for internal and external use.
Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
Proactively explore and drive improvements to the overall quality and reliability of our software platform.
Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 5 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 5 days ago

Apply

🔥 Expert Site Reliability Engineer - GCP

Posted 5 days ago

🏢 Company: DeepSource Technologies

🔧 Requirements

Expertise in Google Cloud networking, Compute Engine, Kubernetes (GKE), Cloud Functions, and Cloud Storage.
Strong knowledge of Terraform, Ansible, or other Infrastructure as Code (IaC) tools.
Experience with Google Kubernetes Engine (GKE), microservices, and container orchestration.
Hands-on experience with FinOps tools and cost optimization strategies in cloud environments.
Familiarity with monitoring and logging solutions such as Google Operations Suite (formerly Stackdriver), Prometheus, Grafana.
Experience with CI/CD pipelines, automation, and GitOps best practices.
Strong understanding of SRE principles, SLAs, SLOs, and error budgets.

💡 Responsibilities

Manage and maintain GCP infrastructure, ensuring high availability, scalability, and system reliability.
Monitor and forecast resource utilization, performance trends, and infrastructure scaling needs to optimize cloud costs and efficiency.
Design and implement highly available, fault-tolerant, and resilient cloud architectures, leveraging Infrastructure as Code (IaC) tools such as Terraform and Ansible.
Utilize Google Cloud Monitoring, Cloud Logging, and third-party tools to proactively detect and resolve performance issues.
Analyze and optimize cloud spending, implement cost controls, recommend rightsizing strategies, and ensure efficient resource allocation.
Implement best practices for IAM, network security, encryption, and compliance frameworks (SOC2, ISO 27001, NIST).
Collaborate with DevOps teams to streamline deployment processes, automate workflows, and optimize application performance.
Design and implement disaster recovery (DR) plans, backup strategies, and failover mechanisms to ensure business continuity.
Maintain comprehensive documentation of infrastructure, best practices, and optimization strategies while working closely with cross-functional teams.

Posted 5 days ago

Apply

🔥 Sr. Site Reliability Engineer - GovCloud (Remote)

Posted 6 days ago

📍 United States

🧭 Full-Time

💸 95000.0 - 160000.0 USD per year

🔍 Cybersecurity

🏢 Company: crowdstrikecareers

🔧 Requirements

5-7+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure roles.
Experience managing Virtual Desktop Infrastructure (VDI) solutions such as Citrix, VMware Horizon, or AWS WorkSpaces.
Hands-on experience with AWS GovCloud (Azure/GCP is a plus).
Strong expertise in Infrastructure as Code (Terraform, CloudFormation).
Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog, Splunk).
Expertise in IAM and PAM solutions such as Okta, CyberArk, or AWS IAM.
Strong scripting and automation skills (Python, Bash, PowerShell).
Experience with CI/CD pipelines and DevOps workflows.
Familiarity with FedRAMP, NIST 800-53, DoD IL 4/5 compliance standards.
Hands-on experience with VDI management, performance tuning, and security hardening.

💡 Responsibilities

Architect, deploy, and maintain highly available, scalable, and secure systems in AWS GovCloud (Azure and GCP experience is a plus).
Automate infrastructure provisioning, scaling, and failover using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Implement SLOs, SLIs, and error budgets to drive reliability improvements.
Optimize cloud infrastructure for performance, cost-efficiency, and resilience while adhering to compliance requirements.
Manage and optimize Virtual Desktop Infrastructure (VDI) solutions to ensure seamless user experience, performance, and security.
Deploy and manage monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
Implement automated self-healing mechanisms and proactive monitoring solutions.
Lead incident response, postmortems, and root cause analysis (RCA) to prevent future system disruptions.
Ensure 24/7 system uptime through on-call rotation and escalation handling.
Implement Identity and Access Management (IAM) best practices, including SSO, MFA, and RBAC across cloud environments.
Automate IAM governance and Privileged Access Management (PAM) to enforce the principle of least privilege.
Ensure audit readiness by maintaining accurate security configurations, logs, and compliance reports.
Work with security teams to align IAM and Zero Trust Architecture (ZTA) strategies with organizational policies.
Develop and maintain CI/CD pipelines for automated deployments and configuration management.
Use Python, Bash, or PowerShell to automate routine SRE workflows and security compliance checks.
Implement immutable infrastructure and support DevSecOps best practices.
Manage and optimize VDI environments, ensuring seamless DevOps integration for development and operational teams.
Contribute to chaos engineering and failure injection testing to enhance system resiliency.
Work closely with DevOps, IT Security, and Compliance teams to ensure system integrity and uptime.
Provide mentorship to junior engineers and contribute to knowledge-sharing initiatives.
Participate in architectural discussions and help drive improvements in cloud reliability and security posture.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzureGrafanaPrometheusCI/CDLinuxDevOpsTerraformComplianceAnsibleScripting

Posted 6 days ago

Apply

🔥 Sr Site Reliability Engineer, Cloud

Posted 6 days ago

🧭 Full-Time

💸 159000.0 - 215000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data Services Information Technology Network Security Software

🔧 Requirements

5+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
Strong experience with public cloud, container and orchestration technologies including AWS, GCP, Azure, Kubernetes, and Docker
Solid programming and automation skills (Bash, Python, Go) including experience working with configuration management (infrastructure as code) platforms such as Terraform, Ansible, and Puppet
Experience working with *nix system command line (e.g. ssh, grep, awk)
Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
A passion for documenting code, processes, and infrastructure in runbooks and wikis
Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships

💡 Responsibilities

Ensure our real-time, scalable, infrastructure is set up for growth and working efficiently.
Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth
Deep-dive into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
Assist with expanding our cloud deployments across the major cloud providers
Contribute code, code reviews and tools or patches to all kinds of existing code
Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team

Posted 6 days ago

Apply

🔥 Site Reliability Engineer IOE: Cardano

Posted 6 days ago

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

Proficiency in Python, Bash, Terraform, Nix for DevOps services.
Extensive experience with AWS, specifically with services like EKS and RDS.
Familiarity with Container orchestration (e.g. Kubernetes) is essential.
Hands-on experience with PostgreSQL and its deployment on RDS.
Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
Solid troubleshooting and performance tuning capabilities.
Exceptional communication skills and team collaboration ethic.
Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).

💡 Responsibilities

Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
Practice sustainable incident response and promote blameless postmortems.
Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
Analyze system performance and reliability, offering recommendations for enhancement.
Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 6 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 6 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 6 days ago

📍 United States

💸 140000.0 - 160000.0 USD per year

🔧 Requirements

Expertise with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Strong background in Reliability Engineering, DevOps, Software Engineering
Fluency with least one programming language, such as C#, Python, Go, etc
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git.
Ability to maintain discretion, handle sensitive information, and improve security best-practices

💡 Responsibilities

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Architectural designs and engineering operations at scale
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs
Evolve and influence Bitwarden's SDLC as we scale
Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformSoftware EngineeringSaaS

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 7 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 7 days ago

Apply

Why remote work is such a nice opportunity?

Posted about 1 month ago

Why is remote work so nice? Let's try to see!

Remote Job Certifications and Courses to Boost Your Career

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Why remote work is such a nice opportunity?

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully