Apply

Sr. Site Reliability Engineer

Posted 15 days agoViewed

View full description

💎 Seniority level: Middle, 4+ years

📍 Location: United States

💸 Salary: 138380.0 - 284900.0 USD per year

🔍 Industry: Software Development

🗣️ Languages: English

⏳ Experience: 4+ years

🪄 Skills: DockerPythonSQLCloud ComputingElasticSearchHadoopKafkaKubernetesMySQLNginxGoREST APICI/CDLinuxDevOpsTerraformMicroservicesAnsibleScripting

Requirements:
  • 4+ years of industry experience, building and operating large scale, high performance distributed systems
  • Experience programming with Python or Go
  • Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
  • Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
  • Infrastructure as code a plus (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc)
  • Bonus points if experienced with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture
Responsibilities:
  • Develop software solutions to enable reliability and operability of large scale distributed systems handling petabytes of data and serving
  • Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  • Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  • Build meaningful, insightful and actionable SLIs
  • Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
  • Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 4 days ago
Apply
Apply

📍 U.S., EU

🧭 Full-Time

🔍 Software Development

🏢 Company: AuthZed👥 11-50💰 $12,000,000 Series A 9 months agoInformation TechnologyCyber SecuritySoftware

  • Proven experience as a Site Reliability Engineer or in a similar role.
  • Strong understanding of networking, operating systems, and cloud infrastructure.
  • Experience with Site Reliability Engineering, System Design, and Distributed Computing.
  • Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
  • Experience working with Git and GitHub.
  • Experience with continuous integration and deployment systems.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration abilities.
  • Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
  • Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
  • Automate infrastructure deployment and configuration management processes.
  • Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
  • Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
  • Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
  • Participate in on-call rotation and respond to production incidents in a timely manner.
  • Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 13 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 170000.0 - 240000.0 USD per year

🔍 Software Development

🏢 Company: HashiCorp👥 1001-5000💰 Secondary Market about 4 years ago🫂 Last layoff almost 2 years agoPrivate CloudDevOpsInformation TechnologyCyber SecuritySoftwareCloud Infrastructure

  • Extensive experience in AWS networking (VPCs, Transit Gateway, PrivateLink, Route 53, ALBs/NLBs)
  • Proficient in service networking and service discovery using Consul
  • Experience implementing network security best practices
  • Experienced in network automation and infrastructure as code using Terraform
  • Solid background in troubleshooting and optimizing network performance
  • Track record of leading large-scale AWS networking initiatives
  • Ability to identify issues and make data-driven decisions
  • Enthusiastic about mentorship in a distributed team environment
  • Experience working in a high-scale, AWS-native environment
  • Design, implement, and maintain AWS networking infrastructure
  • Lead large-scale networking projects
  • Define and implement network reliability strategies
  • Optimize AWS networking configurations
  • Improve service networking and service discovery
  • Enhance network security
  • Develop automation and infrastructure-as-code workflows
  • Identify, diagnose, and address complex network issues
  • Provide technical leadership and mentorship
  • Advocate for improvements to technical roadmaps
  • Stay current with AWS networking advancements
  • Participate in technical hiring efforts

AWSTerraformNetworking

Posted about 1 month ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 186000.0 - 251000.0 USD per year

🔍 Network observability

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of experience in Systems Administration, Datacenter/IT and/or SRE related projects.
  • Experience with *nix system command line (e.g., ssh, grep, awk).
  • Detailed understanding of major internet protocols such as tcp/ip, dns, http, TLS.
  • Experience with or desire to learn about microservices, containers, and orchestration.
  • Networking administration experience with concepts like routing and firewalls.
  • Passion for documenting code, processes, and infrastructure in runbooks and wikis.
  • Strong collaboration and communication skills for a fully remote environment.
  • Experience with configuration management (e.g., Ansible, Puppet, Chef).
  • Familiarity with metrics monitoring solutions (e.g., Grafana, Prometheus).
  • Automation skills in coding languages like Bash, Python, Ruby, or Go.
  • Experience with public cloud services (AWS, GCP, Azure) and Terraform.
  • Ensure our real-time, scalable, microservices-based infrastructure is set up for growth and working efficiently.
  • Work on tools and processes to better monitor our platform and ensure its stability through rapid growth.
  • Deep dive into diverse topics including NetFlow, IP routing, database replication strategies, or HTTP optimization.
  • Collaborate with engineering and infrastructure teams on operational solutions.
  • Contribute code, engage in code reviews, and write design documents for new features or changes.
  • Provide valuable feedback on team goals, projects, and processes for continuous improvement.

DockerPythonBashCloud ComputingKafka*NixGoGrafanagRPCPostgresPrometheusRedisTerraformMicroservices

Posted 3 months ago
Apply
Apply

📍 United States

🏢 Company: ARFA Solutions, LLC

  • Integrated with scrum team 50% of time
  • Monitoring performance and proactively collaborating with other SREs
  • Conducting initiatives not related to team work
  • Ensuring correct logging, keeping synthetics updated, and implementing fail safes
  • Troubleshooting front end JavaScript performance issues
  • Formulating alerts based on log analysis
  • Design and implement highly available and scalable infrastructure solutions
  • Develop and maintain automated deployment, configuration, and monitoring processes
  • Collaborate with cross-functional teams to ensure the reliability, security, and performance of systems
  • Identify and resolve performance and availability issues through proactive monitoring and alerting
  • Participate in incident response and troubleshooting efforts
  • Implement and improve disaster recovery and business continuity strategies
  • Maintain documentation and keep up-to-date with industry best practices
  • Stay current with emerging technologies and trends in the field of SRE
  • Lead and mentor junior members of the SRE team

LeadershipSoftware DevelopmentAgileFrontend DevelopmentHTMLCSSJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaborationCI/CD

Posted 5 months ago
Apply