Sr. Site Reliability Engineer

Posted 6 months agoViewed

💎 Seniority level: Senior

📍 Location: United States

🪄 Skills: LeadershipSoftware DevelopmentAgileFrontend DevelopmentHTMLCSSJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaborationCI/CD

Integrated with scrum team 50% of time
Monitoring performance and proactively collaborating with other SREs
Conducting initiatives not related to team work
Ensuring correct logging, keeping synthetics updated, and implementing fail safes
Troubleshooting front end JavaScript performance issues
Formulating alerts based on log analysis

Design and implement highly available and scalable infrastructure solutions
Develop and maintain automated deployment, configuration, and monitoring processes
Collaborate with cross-functional teams to ensure the reliability, security, and performance of systems
Identify and resolve performance and availability issues through proactive monitoring and alerting
Participate in incident response and troubleshooting efforts
Implement and improve disaster recovery and business continuity strategies
Maintain documentation and keep up-to-date with industry best practices
Stay current with emerging technologies and trends in the field of SRE
Lead and mentor junior members of the SRE team

Posted 24 days ago

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
5+ years of experience with a DevOps or SRE job title
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 24 days ago

Posted about 1 month ago

📍 U.S., EU

🧭 Full-Time

🔍 Software Development

🔧 Requirements

Proven experience as a Site Reliability Engineer or in a similar role.
Strong understanding of networking, operating systems, and cloud infrastructure.
Experience with Site Reliability Engineering, System Design, and Distributed Computing.
Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
Experience with containerization technologies such as Docker and Kubernetes.
Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
Experience working with Git and GitHub.
Experience with continuous integration and deployment systems.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration abilities.

💡 Responsibilities

Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
Automate infrastructure deployment and configuration management processes.
Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
Participate in on-call rotation and respond to production incidents in a timely manner.
Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted about 1 month ago