Apply

Site Reliability Engineer

Posted 13 days agoViewed

View full description

💎 Seniority level: Senior

🔍 Industry: Fintech

🏢 Company: Pleo👥 501-1000💰 $42,922,001 Debt Financing 11 months ago🫂 Last layoff over 2 years agoMobile PaymentsFinancial ServicesPaymentsInformation TechnologyFinTech

🗣️ Languages: English

Requirements:
  • Experience solving complex technical challenges at scale.
  • Ensure a high bar for quality and reliability within your team.
  • Coach others to help them develop as engineers.
  • Advocate for a more thorough code review process than a quick scroll to the bottom of the page and a “LGTM!”
  • Help to design the overall solution.
  • Be sought after within your team for help in solving challenging problems.
  • Be a force multiplier within your team - your work enables other engineers to do even better.
  • Able to raise and describe technical debt faced by your team. And then able to propose a solution or path forward.
Responsibilities:
NOT STATEDApply

Related Jobs

Apply

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

  • 5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
  • Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
  • Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
  • Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
  • Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
  • Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
  • Deep understanding of observability and monitoring with Datadog.
  • Experience with Kyverno for Kubernetes policy-based security enforcement.
  • Proficiency in Python and Bash scripting for automation and system management.
  • Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.
  • Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
  • Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
  • Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
  • Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
  • Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
  • Define and enforce policy-based security for Kubernetes using Kyverno.
  • Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
  • Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
  • Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
  • Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 2 hours ago
Apply
Apply

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications InfrastructureAnalyticsData VisualizationSoftware

  • 1 - 3 years of experience in an Infrastructure, SRE, DevOps, CloudOps role
  • Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
  • Experience with Terraform, Ansible, or any similar programming language
  • Experience with at least one cloud technology - AWS or Azure. Preferably Azure
  • Experience with cloud-performant microservices and event-driven architectures
  • Experience with Kubernetes administration is an added advantage
  • Understanding of information security concepts and terminology
  • Distributed monitoring experience: logging, metrics, tracing, etc.
  • Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
  • Ability to analyze problems quickly and find suitable solutions based on available resources
  • A proactive and open-minded individual with a clear client focus and structured approach
  • Experience in leading and managing a team
  • Сo-own critical production service designs to ensure high reliability is achievable and measurable
  • Drive reliability and observability improvements in the services within the engineering verticals
  • Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
  • Build and improve internal tools and automation software to make maintaining production services easier and safer
  • Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
  • Developing Infrastructure as a Code
  • You will build SRE dashboards from SLIs to measure SLO adherence
  • Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
  • Point of contact for production application issues, working closely with engineering leadership
Posted 1 day ago
Apply
Apply

🔍 Software Development

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications InfrastructureAnalyticsData VisualizationSoftware

  • 3 - 6+ years of experience in an Infrastructure, SRE, DevOps, CloudOps role
  • Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
  • Experience with Terraform, Ansible, or any similar programming language
  • Experience with at least one cloud technology - AWS or Azure. Preferably Azure
  • Experience with cloud-performant microservices and event-driven architectures
  • Experience with Kubernetes administration is an added advantage.
  • Understanding of information security concepts and terminology
  • Distributed monitoring experience: logging, metrics, tracing, etc.
  • Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
  • Ability to analyze problems quickly and find suitable solutions based on available resources
  • A proactive and open-minded individual with a clear client focus and structured approach
  • Experience in leading and managing a team
  • Сo-own critical production service designs to ensure high reliability is achievable and measurable
  • Drive reliability and observability improvements in the services within the engineering verticals
  • Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
  • Build and improve internal tools and automation software to make maintaining production services easier and safer
  • Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
  • Developing Infrastructure as a Code.
  • You will build SRE dashboards from SLIs to measure SLO adherence
  • Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
  • Point of contact for production application issues, working closely with engineering leadership
Posted 1 day ago
Apply
Apply

📍 Brazil, Mexico, Peru

🧭 Full-Time

🔍 Software Development

🏢 Company: Zipdev👥 11-50Web DevelopmentWeb DesignSoftware

  • 3 -4 years of proven professional experience as a Site Reliability Engineer.
  • Experience with one or more general-purpose programming/scripting languages including but not limited to: Python, Bash, Perl or Go.
  • Fundamental knowledge of technologies across a broad range of disciplines: virtualization storage, networking, server, and security
  • Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
  • Experience in analyzing logs and troubleshooting large-scale distributed systems.
  • Build systems and infrastructure to monitor complex, large-scale distributed systems
  • Identify stability/performance issues and collaborate with developers to triage critical issues in production systems.
  • Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
  • Devise ways to actively monitor system throughput, capacity and reliability.
  • Ability to debug complex systems and evolve a running environment without downtime.
  • Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
  • Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.

AWSDockerPythonBashCloud ComputingElasticSearchKubernetes*NixZabbixAlgorithmsData StructuresREST APICI/CDLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingJSONAnsibleScriptingDebugging

Posted 2 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

  • 3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
  • Experience with AWS, Azure, or GCP cloud infrastructure
  • Experience with PHP and Javascript/Typescript
  • Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.
  • Be the escalation point for problems and incidents for our Customer Support teams.
  • Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
  • Internally communicate the status of problems and incidents.
  • Generate Root Cause Analysis (RCA) statements for internal and external use.
  • Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
  • Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
  • Proactively explore and drive improvements to the overall quality and reliability of our software platform.
  • Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 3 days ago
Apply
Apply

🏢 Company: Betsson Group👥 1001-5000InternetGamingGamblingOnline Games

  • Supporting and troubleshooting.
  • Using automation and configuration management tools (Octopus, Team City, Terraform)
  • AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments
  • Cloud Application Load Balancer, preferably with experience on AWS ALB
  • Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS
  • Serverless Computing such as AWS Lambda
  • Cloud Firewall such as AWS WAF
  • Server virtualisation such as VMware, IaaS and PaaS cloud such as AWS and Azure
  • Open-source monitoring and alerting tools (Prometheus, Loki, Grafana and Jaeger)
  • Scripting in Python, Bash, Powershell or others
  • Microsoft SQL databases via Stored Procedures, Locking/Unlocking tables and running select statements to assess impact and diagnose problems
  • Being the first point of technical escalation of issues within our infrastructure both in cloud and on-prem.
  • Participating in stand-ups with the development teams and informing your squad of updates and changes to our platform.
  • Automating everything – Workflow and tool automation - such as deployments of distributed applications and infrastructure using various scripting languages to allow our 24/7 Incident Engineers to mitigate incidents without escalation.
  • Able to analyse, diagnose and solve issues in the production environment with minimal number of escalations to supporting 3rd Level support teams.
  • Participate in Change Management process via review of RFC’s to ensure “Definition of Done” as well as executing and supporting software and hardware deployments.
  • Developing and Documenting ways-of-working between the LiveOps(NOC) Team and the development teams to improve efficiencies in diagnostics and impact mitigation.
Posted 7 days ago
Apply
Apply

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

  • 5+ years of Ruby/Rail Experience
  • 3+ years of AWS Experience
  • Kubernetes experience
  • Experience with profiling and benchmarking source code
  • Effective at code review, and identifying potential performance problems before they reach production
  • Experience with Datadog or other APM tools
  • Excellent written and verbal communication skills
  • Proactively identify, triage, and resolve performance issues
  • Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
  • Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
  • Optimize performance through instance configuration and monitoring
  • Collaborate with other SREs to proactively identify and address performance bottlenecks
  • Lead database capacity planning and upgrade initiatives
  • Manage the database-specific components of disaster recovery planning and execution
  • Oversee backup systems and pre-production databases
  • Create and maintain infrastructure and operations documentation
  • Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago
Apply
Apply

🏢 Company: DeepSource Technologies

  • Expertise in Google Cloud networking, Compute Engine, Kubernetes (GKE), Cloud Functions, and Cloud Storage.
  • Strong knowledge of Terraform, Ansible, or other Infrastructure as Code (IaC) tools.
  • Experience with Google Kubernetes Engine (GKE), microservices, and container orchestration.
  • Hands-on experience with FinOps tools and cost optimization strategies in cloud environments.
  • Familiarity with monitoring and logging solutions such as Google Operations Suite (formerly Stackdriver), Prometheus, Grafana.
  • Experience with CI/CD pipelines, automation, and GitOps best practices.
  • Strong understanding of SRE principles, SLAs, SLOs, and error budgets.
  • Manage and maintain GCP infrastructure, ensuring high availability, scalability, and system reliability.
  • Monitor and forecast resource utilization, performance trends, and infrastructure scaling needs to optimize cloud costs and efficiency.
  • Design and implement highly available, fault-tolerant, and resilient cloud architectures, leveraging Infrastructure as Code (IaC) tools such as Terraform and Ansible.
  • Utilize Google Cloud Monitoring, Cloud Logging, and third-party tools to proactively detect and resolve performance issues.
  • Analyze and optimize cloud spending, implement cost controls, recommend rightsizing strategies, and ensure efficient resource allocation.
  • Implement best practices for IAM, network security, encryption, and compliance frameworks (SOC2, ISO 27001, NIST).
  • Collaborate with DevOps teams to streamline deployment processes, automate workflows, and optimize application performance.
  • Design and implement disaster recovery (DR) plans, backup strategies, and failover mechanisms to ensure business continuity.
  • Maintain comprehensive documentation of infrastructure, best practices, and optimization strategies while working closely with cross-functional teams.
Posted 7 days ago
Apply
Apply

🧭 Full-Time

💸 159000.0 - 215000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
  • Strong experience with public cloud, container and orchestration technologies including AWS, GCP, Azure, Kubernetes, and Docker
  • Solid programming and automation skills (Bash, Python, Go) including experience working  with configuration management (infrastructure as code) platforms such as Terraform, Ansible, and Puppet
  • Experience working with *nix system command line (e.g. ssh, grep, awk)
  • Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
  • Networking administration experience: concepts such as routing, firewalls (iptables),  peering sound familiar
  • A passion for documenting code, processes, and infrastructure in runbooks and wikis
  • Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
  • Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships
  • Ensure our real-time, scalable, infrastructure is set up for growth and working efficiently.
  • Work on tools and processes to better monitor our platform as well as ensuring  its stability through our rapid growth
  • Deep-dive into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
  • Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
  • Assist with expanding our cloud deployments across the major cloud providers
  • Contribute code, code reviews and tools or patches to all kinds of existing code
  • Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
  • Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team
Posted 8 days ago
Apply
Apply

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

  • Proficiency in Python, Bash, Terraform, Nix for DevOps services.
  • Extensive experience with AWS, specifically with services like EKS and RDS.
  • Familiarity with Container orchestration (e.g. Kubernetes) is essential.
  • Hands-on experience with PostgreSQL and its deployment on RDS.
  • Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
  • Solid troubleshooting and performance tuning capabilities.
  • Exceptional communication skills and team collaboration ethic.
  • Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).
  • Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
  • Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
  • Practice sustainable incident response and promote blameless postmortems.
  • Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
  • Analyze system performance and reliability, offering recommendations for enhancement.
  • Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
  • Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 8 days ago
Apply

Related Articles

Posted about 1 month ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 8 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.