ApplySenior Reliability Engineer
Posted 4 months agoViewed
View full description
Requirements:
- Operationally focused with expertise in incident management and live production issue resolution.
- Strong debugging and troubleshooting skills, particularly in large-scale applications performance optimization.
- Proven experience in building and maintaining monitoring and alerting systems.
- 7+ years of experience with .NET Framework (C#) for production stability.
- Strong knowledge of Kubernetes, Docker, and cloud platforms like GCP.
- Proficiency with monitoring tools such as Prometheus, Grafana, and Kibana.
- Experience with incident ticketing/documentation tools like FreshDesk and Confluence.
- Critical thinking ability to identify system weaknesses and innovate solutions.
- Strong project management skills focused on scalability and stability.
- ITIL Service Management certification (or equivalent) is highly desired.
- Experience with PowerBI, web scraping, or Golang is a plus.
Responsibilities:
- Provide live operational support for multiple client software applications, ensuring rapid restoration of services.
- Develop and maintain code to quickly resolve production issues.
- Own and resolve incidents, adhering to client SLA and internal SLO timelines.
- Troubleshoot complex incidents and implement solutions to prevent recurrence.
- Utilize data-driven approaches to prepare detailed analyses and reports.
- Conduct deep technical analyses of product deficiencies and address client pain points.
- Develop monitoring systems and implement robust alert mechanisms.
- Provide guidance on improving operational system stability.
- Lead initiatives that automate processes for operational efficiency.
- Facilitate postmortem meetings following incidents.
- Collaborate with cross-functional teams for rapid resolution of production issues.
- Lead and motivate project teams to ensure quality standards.
- Mentor reliability engineers and track their progress.
- Participate in after-hours on-call support.
Apply