Senior System Reliability Engineer

L
LirioTechnology/Software, Healthcare
USFull-TimeSenior
Salary130000 - 150000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5-7 years related experience
Required Skills
AWSPythonSQLGitJavaKafkaKubernetesTypeScriptAzureGroovyLinuxTerraformDatadogCloudFormation

Requirements

  • 5-7 years related experience
  • Bachelor's Degree in related field
  • Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
  • Distributed systems debugging and failure analysis
  • Load, stress, and fault-injection testing
  • CI/CD tools and processes
  • Version control (e.g., Git)
  • Cloud platforms (e.g., AWS, Azure)
  • Containers and orchestration (Kubernetes)
  • Kafka (messaging/streaming)
  • Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
  • Agile methodologies (e.g., Scrum, XP, SAFe)
  • Databases/SQL
  • Observability/monitoring tools (DataDog)

Responsibilities

  • Architect, implement, and maintain automated solutions for deployment, monitoring, alerting and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL).
  • Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation).
  • Build and optimize CI/CD pipelines for seamless, reliable delivery.
  • Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services.
  • Identify and reduce operational toil through automation, platform improvements, and architectural changes.
  • Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments.
  • Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness.
  • Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows.
  • Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations.
  • Lead incident response, root cause analysis, and postmortems for production issues.
  • Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices.
  • Stay current with industry trends in reliability engineering, cloud operations, and automation.
View Full Description & ApplyYou'll be redirected to the employer's site
130000 - 150000 USD per year
Apply Now