Senior System Reliability Engineer
New
USFull-TimeSenior
Salary130000 - 150000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5-7 years related experience
- Required Skills
- AWSPythonSQLGitJavaKafkaKubernetesTypeScriptAzureGrafanaGroovyPrometheusLinuxTerraformDatadogCloudFormation
Requirements
- 5-7 years related experience
- Bachelor's Degree in related field
- Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
- Distributed systems debugging and failure analysis
- Load, stress, and fault-injection testing
- CI/CD tools and processes
- Version control (e.g., Git)
- Cloud platforms (e.g., AWS, Azure)
- Containers and orchestration (Kubernetes)
- Kafka (messaging/streaming)
- Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
- Agile methodologies (e.g., Scrum, XP, SAFe)
- Databases/SQL
- Observability/monitoring tools (DataDog)
Responsibilities
- Architect, implement, and maintain automated solutions for deployment, monitoring, alerting and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL).
- Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation).
- Build and optimize CI/CD pipelines for seamless, reliable delivery.
- Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services.
- Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog).
- Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations.
- Lead incident response, root cause analysis, and postmortems for production issues.
- Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices.
- Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness.
- Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows.
View Full Description & ApplyYou'll be redirected to the employer's site