Sr. Site Reliability Engineer

Posted 15 days agoViewed

View full description

💎 Seniority level: Middle, 4+ years

📍 Location: United States

💸 Salary: 138380.0 - 284900.0 USD per year

🔍 Industry: Software Development

🗣️ Languages: English

⏳ Experience: 4+ years

🪄 Skills: DockerPythonSQLCloud ComputingElasticSearchHadoopKafkaKubernetesMySQLNginxGoREST APICI/CDLinuxDevOpsTerraformMicroservicesAnsibleScripting

Requirements:

4+ years of industry experience, building and operating large scale, high performance distributed systems
Experience programming with Python or Go
Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
Infrastructure as code a plus (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc)
Bonus points if experienced with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture

Responsibilities:

Develop software solutions to enable reliability and operability of large scale distributed systems handling petabytes of data and serving
Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
Build meaningful, insightful and actionable SLIs
Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world

Apply

Related Jobs

Apply

🔥 Sr Site Reliability Engineer (SRE)

Posted 4 days ago

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
5+ years of experience with a DevOps or SRE job title
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 4 days ago

Apply

🔥 Sr. Site Reliability Engineer

Posted 13 days ago

📍 U.S., EU

🧭 Full-Time

🔍 Software Development

🏢 Company: AuthZed👥 11-50💰 $12,000,000 Series A 9 months agoInformation Technology Cyber Security Software

🔧 Requirements

Proven experience as a Site Reliability Engineer or in a similar role.
Strong understanding of networking, operating systems, and cloud infrastructure.
Experience with Site Reliability Engineering, System Design, and Distributed Computing.
Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
Experience with containerization technologies such as Docker and Kubernetes.
Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
Experience working with Git and GitHub.
Experience with continuous integration and deployment systems.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration abilities.

💡 Responsibilities

Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
Automate infrastructure deployment and configuration management processes.
Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
Participate in on-call rotation and respond to production incidents in a timely manner.
Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 13 days ago

Apply

🔥 Sr. Site Reliability Engineer II - Network, Infrastructure Services

Posted about 1 month ago

📍 United States

🧭 Full-Time

💸 170000.0 - 240000.0 USD per year

🔍 Software Development

🏢 Company: HashiCorp👥 1001-5000💰 Secondary Market about 4 years ago🫂 Last layoff almost 2 years agoPrivate Cloud DevOps Information Technology Cyber Security Software Cloud Infrastructure

🔧 Requirements

Extensive experience in AWS networking (VPCs, Transit Gateway, PrivateLink, Route 53, ALBs/NLBs)
Proficient in service networking and service discovery using Consul
Experience implementing network security best practices
Experienced in network automation and infrastructure as code using Terraform
Solid background in troubleshooting and optimizing network performance
Track record of leading large-scale AWS networking initiatives
Ability to identify issues and make data-driven decisions
Enthusiastic about mentorship in a distributed team environment
Experience working in a high-scale, AWS-native environment

💡 Responsibilities

Design, implement, and maintain AWS networking infrastructure
Lead large-scale networking projects
Define and implement network reliability strategies
Optimize AWS networking configurations
Improve service networking and service discovery
Enhance network security
Develop automation and infrastructure-as-code workflows
Identify, diagnose, and address complex network issues
Provide technical leadership and mentorship
Advocate for improvements to technical roadmaps
Stay current with AWS networking advancements
Participate in technical hiring efforts

AWSTerraformNetworking

Posted about 1 month ago

Apply

🔥 Sr Site Reliability Engineer, Platform Engineering

Posted 3 months ago

📍 United States

🧭 Full-Time

💸 186000.0 - 251000.0 USD per year

🔍 Network observability

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data Services Information Technology Network Security Software

🔧 Requirements

5+ years of experience in Systems Administration, Datacenter/IT and/or SRE related projects.
Experience with *nix system command line (e.g., ssh, grep, awk).
Detailed understanding of major internet protocols such as tcp/ip, dns, http, TLS.
Experience with or desire to learn about microservices, containers, and orchestration.
Networking administration experience with concepts like routing and firewalls.
Passion for documenting code, processes, and infrastructure in runbooks and wikis.
Strong collaboration and communication skills for a fully remote environment.
Experience with configuration management (e.g., Ansible, Puppet, Chef).
Familiarity with metrics monitoring solutions (e.g., Grafana, Prometheus).
Automation skills in coding languages like Bash, Python, Ruby, or Go.
Experience with public cloud services (AWS, GCP, Azure) and Terraform.

💡 Responsibilities

Ensure our real-time, scalable, microservices-based infrastructure is set up for growth and working efficiently.
Work on tools and processes to better monitor our platform and ensure its stability through rapid growth.
Deep dive into diverse topics including NetFlow, IP routing, database replication strategies, or HTTP optimization.
Collaborate with engineering and infrastructure teams on operational solutions.
Contribute code, engage in code reviews, and write design documents for new features or changes.
Provide valuable feedback on team goals, projects, and processes for continuous improvement.

DockerPythonBashCloud ComputingKafka*NixGoGrafanagRPCPostgresPrometheusRedisTerraformMicroservices

Posted 3 months ago

Apply

🔥 Sr. Site Reliability Engineer

Posted 5 months ago

📍 United States

🏢 Company: ARFA Solutions, LLC

🔧 Requirements

Integrated with scrum team 50% of time
Monitoring performance and proactively collaborating with other SREs
Conducting initiatives not related to team work
Ensuring correct logging, keeping synthetics updated, and implementing fail safes
Troubleshooting front end JavaScript performance issues
Formulating alerts based on log analysis

💡 Responsibilities

Design and implement highly available and scalable infrastructure solutions
Develop and maintain automated deployment, configuration, and monitoring processes
Collaborate with cross-functional teams to ensure the reliability, security, and performance of systems
Identify and resolve performance and availability issues through proactive monitoring and alerting
Participate in incident response and troubleshooting efforts
Implement and improve disaster recovery and business continuity strategies
Maintain documentation and keep up-to-date with industry best practices
Stay current with emerging technologies and trends in the field of SRE
Lead and mentor junior members of the SRE team

LeadershipSoftware DevelopmentAgileFrontend DevelopmentHTMLCSSJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaborationCI/CD

Posted 5 months ago

Apply