Software Engineer, Site Reliability (Senior or Staff)

Posted 5 days agoViewed

View full description

💎 Seniority level: Senior, 10-12+ years

📍 Location: United States, Canada

🔍 Industry: Software Development

🏢 Company: BioRender👥 101-250💰 $15,319,133 Series A almost 2 years agoLife Science Graphic Design Software

🗣️ Languages: English

⏳ Experience: 10-12+ years

🪄 Skills: AWSDockerLeadershipPythonSQLBashCloud ComputingJavascriptKubernetesSoftware ArchitectureTypeScriptGoGrafanaPrometheusREST APICommunication SkillsAnalytical SkillsCI/CDProblem SolvingAgile methodologiesRESTful APIsMentoringLinuxDevOpsTerraformWritten communicationMicroservicesNetworkingAdaptabilityTeamworkTroubleshootingActive listeningStrong work ethicStrong communication skillsAnsibleSoftware EngineeringDebugging

Requirements:

10-12+ years of experience in the software/DevOps/SRE realm
Strong programming skills in 2 or more of these languages: javascript, typescript, python, Go
Ability to troubleshoot complex distributed systems at scale
Database Performance Monitoring and best practices
Comfortable innovating and establishing new practices, processes, and tooling
Strong analytical skills, system design, and architecture for cloud applications
CI/CD, configuration management, monitoring, and automation expertise
Advanced knowledge of observability and best practices (ELK, Datadog, OpenTelemetry, Prometheus, Grafana)
Deployment and orchestration via AWS ECS, k8s, CloudRun etc.
Understanding of Linux, virtualization, networking, VPCs, firewalls, security groups
Hands-on knowledge of AWS and resources provisioning via CLI/API/IaC
Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience.

Responsibilities:

Enhance platform resilience by constantly seeking ways to improve the reliability, scalability and release efficiency of the platform
Develop Robust Observability and Monitoring Solutions: Define, build, deploy, maintain, and extend advanced observability and monitoring tools to bolster system reliability and availability.
Define and Monitor Performance Metrics: Play a key role in formulating and tracking Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to establish precise benchmarks for system performance.
Solve Complex Issues and Conduct Root Cause Analysis: Swiftly respond to escalated incidents, troubleshoot intricate system and application problems, and conduct thorough root cause analyses to implement corrective measures.
Thought Leadership and Innovation: stay up to date with the latest industry trends and emerging technologies and iterate on best practices to increase the quality & velocity of development and deliverables.
Architect Scalable and Reliable Systems: Lead in the design and architecture of scalable, distributed, fault-tolerant systems that uphold performance and reliability standards.
Mentorship and Evangelism: Champion the adoption of new technologies, disseminate best practices, and advocate for architectural patterns. Mentor and guide fellow engineers in the organization.

Apply