Site Reliability Engineer

Posted 13 days agoViewed

💎 Seniority level: Senior

🔍 Industry: Fintech

🏢 Company: Pleo👥 501-1000💰 $42,922,001 Debt Financing 11 months ago🫂 Last layoff over 2 years agoMobile Payments Financial Services Payments Information Technology FinTech

🗣️ Languages: English

Requirements:

Experience solving complex technical challenges at scale.

Ensure a high bar for quality and reliability within your team.

Coach others to help them develop as engineers.

Advocate for a more thorough code review process than a quick scroll to the bottom of the page and a “LGTM!”

Help to design the overall solution.

Be sought after within your team for help in solving challenging problems.

Be a force multiplier within your team - your work enables other engineers to do even better.

Able to raise and describe technical debt faced by your team. And then able to propose a solution or path forward.

Responsibilities:

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 hours ago

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

🔧 Requirements

5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
Deep understanding of observability and monitoring with Datadog.
Experience with Kyverno for Kubernetes policy-based security enforcement.
Proficiency in Python and Bash scripting for automation and system management.
Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.

💡 Responsibilities

Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
Define and enforce policy-based security for Kubernetes using Kyverno.
Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 2 hours ago

Apply

🔥 Site Reliability Engineer

Posted 1 day ago

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications Infrastructure Analytics Data Visualization Software

🔧 Requirements

1 - 3 years of experience in an Infrastructure, SRE, DevOps, CloudOps role
Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
Experience with Terraform, Ansible, or any similar programming language
Experience with at least one cloud technology - AWS or Azure. Preferably Azure
Experience with cloud-performant microservices and event-driven architectures
Experience with Kubernetes administration is an added advantage
Understanding of information security concepts and terminology
Distributed monitoring experience: logging, metrics, tracing, etc.
Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
Ability to analyze problems quickly and find suitable solutions based on available resources
A proactive and open-minded individual with a clear client focus and structured approach
Experience in leading and managing a team

💡 Responsibilities

Сo-own critical production service designs to ensure high reliability is achievable and measurable
Drive reliability and observability improvements in the services within the engineering verticals
Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
Build and improve internal tools and automation software to make maintaining production services easier and safer
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
Developing Infrastructure as a Code
You will build SRE dashboards from SLIs to measure SLO adherence
Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
Point of contact for production application issues, working closely with engineering leadership

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer

Posted 1 day ago

🔍 Software Development

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications Infrastructure Analytics Data Visualization Software

🔧 Requirements

3 - 6+ years of experience in an Infrastructure, SRE, DevOps, CloudOps role
Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
Experience with Terraform, Ansible, or any similar programming language
Experience with at least one cloud technology - AWS or Azure. Preferably Azure
Experience with cloud-performant microservices and event-driven architectures
Experience with Kubernetes administration is an added advantage.
Understanding of information security concepts and terminology
Distributed monitoring experience: logging, metrics, tracing, etc.
Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
Ability to analyze problems quickly and find suitable solutions based on available resources
A proactive and open-minded individual with a clear client focus and structured approach
Experience in leading and managing a team

💡 Responsibilities

Сo-own critical production service designs to ensure high reliability is achievable and measurable
Drive reliability and observability improvements in the services within the engineering verticals
Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
Build and improve internal tools and automation software to make maintaining production services easier and safer
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
Developing Infrastructure as a Code.
You will build SRE dashboards from SLIs to measure SLO adherence
Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
Point of contact for production application issues, working closely with engineering leadership

Posted 1 day ago

Apply

🔥 Mid Site Reliability Engineer

Posted 2 days ago

📍 Brazil, Mexico, Peru

🧭 Full-Time

🔍 Software Development

🏢 Company: Zipdev👥 11-50 Web Development Web Design Software

🔧 Requirements

3 -4 years of proven professional experience as a Site Reliability Engineer.
Experience with one or more general-purpose programming/scripting languages including but not limited to: Python, Bash, Perl or Go.
Fundamental knowledge of technologies across a broad range of disciplines: virtualization storage, networking, server, and security
Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
Experience in analyzing logs and troubleshooting large-scale distributed systems.

💡 Responsibilities

Build systems and infrastructure to monitor complex, large-scale distributed systems
Identify stability/performance issues and collaborate with developers to triage critical issues in production systems.
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Devise ways to actively monitor system throughput, capacity and reliability.
Ability to debug complex systems and evolve a running environment without downtime.
Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.

AWSDockerPythonBashCloud ComputingElasticSearchKubernetes*NixZabbixAlgorithmsData StructuresREST APICI/CDLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingJSONAnsibleScriptingDebugging

Posted 2 days ago

Apply

🔥 Site Reliability Engineer (CST or EST Remote)

Posted 3 days ago

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

🔧 Requirements

3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
Experience with AWS, Azure, or GCP cloud infrastructure
Experience with PHP and Javascript/Typescript
Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.

💡 Responsibilities

Be the escalation point for problems and incidents for our Customer Support teams.
Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
Internally communicate the status of problems and incidents.
Generate Root Cause Analysis (RCA) statements for internal and external use.
Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
Proactively explore and drive improvements to the overall quality and reliability of our software platform.
Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 3 days ago

Apply

🔥 Site Reliability Engineer

Posted 7 days ago

🏢 Company: Betsson Group👥 1001-5000 Internet Gaming Gambling Online Games

🔧 Requirements

Supporting and troubleshooting.
Using automation and configuration management tools (Octopus, Team City, Terraform)
AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments
Cloud Application Load Balancer, preferably with experience on AWS ALB
Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS
Serverless Computing such as AWS Lambda
Cloud Firewall such as AWS WAF
Server virtualisation such as VMware, IaaS and PaaS cloud such as AWS and Azure
Open-source monitoring and alerting tools (Prometheus, Loki, Grafana and Jaeger)
Scripting in Python, Bash, Powershell or others
Microsoft SQL databases via Stored Procedures, Locking/Unlocking tables and running select statements to assess impact and diagnose problems

💡 Responsibilities

Being the first point of technical escalation of issues within our infrastructure both in cloud and on-prem.
Participating in stand-ups with the development teams and informing your squad of updates and changes to our platform.
Automating everything – Workflow and tool automation - such as deployments of distributed applications and infrastructure using various scripting languages to allow our 24/7 Incident Engineers to mitigate incidents without escalation.
Able to analyse, diagnose and solve issues in the production environment with minimal number of escalations to supporting 3rd Level support teams.
Participate in Change Management process via review of RFC’s to ensure “Definition of Done” as well as executing and supporting software and hardware deployments.
Developing and Documenting ways-of-working between the LiveOps(NOC) Team and the development teams to improve efficiencies in diagnostics and impact mitigation.

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 7 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago

Apply

🔥 Expert Site Reliability Engineer - GCP

Posted 7 days ago

🏢 Company: DeepSource Technologies

🔧 Requirements

Expertise in Google Cloud networking, Compute Engine, Kubernetes (GKE), Cloud Functions, and Cloud Storage.
Strong knowledge of Terraform, Ansible, or other Infrastructure as Code (IaC) tools.
Experience with Google Kubernetes Engine (GKE), microservices, and container orchestration.
Hands-on experience with FinOps tools and cost optimization strategies in cloud environments.
Familiarity with monitoring and logging solutions such as Google Operations Suite (formerly Stackdriver), Prometheus, Grafana.
Experience with CI/CD pipelines, automation, and GitOps best practices.
Strong understanding of SRE principles, SLAs, SLOs, and error budgets.

💡 Responsibilities

Manage and maintain GCP infrastructure, ensuring high availability, scalability, and system reliability.
Monitor and forecast resource utilization, performance trends, and infrastructure scaling needs to optimize cloud costs and efficiency.
Design and implement highly available, fault-tolerant, and resilient cloud architectures, leveraging Infrastructure as Code (IaC) tools such as Terraform and Ansible.
Utilize Google Cloud Monitoring, Cloud Logging, and third-party tools to proactively detect and resolve performance issues.
Analyze and optimize cloud spending, implement cost controls, recommend rightsizing strategies, and ensure efficient resource allocation.
Implement best practices for IAM, network security, encryption, and compliance frameworks (SOC2, ISO 27001, NIST).
Collaborate with DevOps teams to streamline deployment processes, automate workflows, and optimize application performance.
Design and implement disaster recovery (DR) plans, backup strategies, and failover mechanisms to ensure business continuity.
Maintain comprehensive documentation of infrastructure, best practices, and optimization strategies while working closely with cross-functional teams.

Posted 7 days ago

Apply

🔥 Sr Site Reliability Engineer, Cloud

Posted 8 days ago

🧭 Full-Time

💸 159000.0 - 215000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data Services Information Technology Network Security Software

🔧 Requirements

5+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
Strong experience with public cloud, container and orchestration technologies including AWS, GCP, Azure, Kubernetes, and Docker
Solid programming and automation skills (Bash, Python, Go) including experience working with configuration management (infrastructure as code) platforms such as Terraform, Ansible, and Puppet
Experience working with *nix system command line (e.g. ssh, grep, awk)
Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
A passion for documenting code, processes, and infrastructure in runbooks and wikis
Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships

💡 Responsibilities

Ensure our real-time, scalable, infrastructure is set up for growth and working efficiently.
Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth
Deep-dive into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
Assist with expanding our cloud deployments across the major cloud providers
Contribute code, code reviews and tools or patches to all kinds of existing code
Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team

Posted 8 days ago

Apply

🔥 Site Reliability Engineer IOE: Cardano

Posted 8 days ago

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

Proficiency in Python, Bash, Terraform, Nix for DevOps services.
Extensive experience with AWS, specifically with services like EKS and RDS.
Familiarity with Container orchestration (e.g. Kubernetes) is essential.
Hands-on experience with PostgreSQL and its deployment on RDS.
Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
Solid troubleshooting and performance tuning capabilities.
Exceptional communication skills and team collaboration ethic.
Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).

💡 Responsibilities

Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
Practice sustainable incident response and promote blameless postmortems.
Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
Analyze system performance and reliability, offering recommendations for enhancement.
Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 8 days ago

Apply

Why remote work is such a nice opportunity?

Posted about 1 month ago

Why is remote work so nice? Let's try to see!

Remote Job Certifications and Courses to Boost Your Career

Posted 8 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Why remote work is such a nice opportunity?

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully