Apply

Sr. Site Reliability Engineer

Posted 6 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: United States

🏢 Company: ARFA Solutions, LLC

🪄 Skills: LeadershipSoftware DevelopmentAgileFrontend DevelopmentHTMLCSSJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaborationCI/CD

Requirements:
  • Integrated with scrum team 50% of time
  • Monitoring performance and proactively collaborating with other SREs
  • Conducting initiatives not related to team work
  • Ensuring correct logging, keeping synthetics updated, and implementing fail safes
  • Troubleshooting front end JavaScript performance issues
  • Formulating alerts based on log analysis
Responsibilities:
  • Design and implement highly available and scalable infrastructure solutions
  • Develop and maintain automated deployment, configuration, and monitoring processes
  • Collaborate with cross-functional teams to ensure the reliability, security, and performance of systems
  • Identify and resolve performance and availability issues through proactive monitoring and alerting
  • Participate in incident response and troubleshooting efforts
  • Implement and improve disaster recovery and business continuity strategies
  • Maintain documentation and keep up-to-date with industry best practices
  • Stay current with emerging technologies and trends in the field of SRE
  • Lead and mentor junior members of the SRE team
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 24 days ago
Apply
Apply
🔥 Sr. Site Reliability Engineer
Posted about 1 month ago

📍 U.S., EU

🧭 Full-Time

🔍 Software Development

🏢 Company: AuthZed👥 11-50💰 $12,000,000 Series A 9 months agoInformation TechnologyCyber SecuritySoftware

  • Proven experience as a Site Reliability Engineer or in a similar role.
  • Strong understanding of networking, operating systems, and cloud infrastructure.
  • Experience with Site Reliability Engineering, System Design, and Distributed Computing.
  • Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
  • Experience working with Git and GitHub.
  • Experience with continuous integration and deployment systems.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration abilities.
  • Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
  • Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
  • Automate infrastructure deployment and configuration management processes.
  • Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
  • Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
  • Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
  • Participate in on-call rotation and respond to production incidents in a timely manner.
  • Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted about 1 month ago
Apply