Site Reliability Engineer

Posted about 5 hours agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: United States

🔍 Industry: Software Development

🏢 Company: TCGPlayer_External_Career

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSDockerCloud ComputingGCPKubernetesAzureCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Requirements:

5+ years of experience in Site Reliability Engineering or related roles
Experience with an enterprise monitoring solution (New Relic, Scalyr, Datadog, Etc)
Experience managing Linux and/or Windows environments
Experience with IaaS and PaaS solutions (i.e. AWS, GCP, Azure, etc.)
Experience with Infrastructure as Code (Terraform or Helm)
Knowledge of Kubernetes / ECS orchestration, and containerization (e.g. Docker)
Demonstrable expertise around specifying, designing and/or implementing system health, performance monitoring tools and software management tools for 24x7 environments
Proficiency in writing code / scripts to automate tasks
Excellent critical thinking and solving skills

Responsibilities:

Innovate, build, and evangelize the practice of site reliability so that TCGPlayer can deliver excellent customer experiences.
Define and measure key performance metrics, such as SLAs and Mean Time Between Failures (MTBF), using those metrics to identify trends and measure the impact on the business.
Develop and maintain up-to-date operational procedures, including runbooks, to adapt to evolving needs.
Anticipate system failures through practices like chaos engineering and tabletop exercises, and establish processes to learn from operational incidents.
Foster strong relationships within the team and across departments while cultivating a communicative, supportive, and results-oriented culture.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer - Midnight

Posted about 14 hours ago

📍 United States

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

7+ years of experience in SRE, DevOps, or a related role.
Understanding of SRE best practices, architectures, and methods.
Good knowledge on resiliency patterns and cloud security.
Strong programming proficiency in Python, Golang, or Javascript.
Demonstrated experience with AWS and modern cloud architectures.
Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
Hands-on experience with Kubernetes/EKS and GitOps methodologies.
Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
Ability to engage in technical discussions and be part of the decision making process
Strong problem-solving skills and capability to work on complex systems
Experience in working within an Agile environment
Experience in working with a distributed team
Strong communication and collaboration abilities to work seamlessly across different teams.
A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.

💡 Responsibilities

Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
Leverage GitOps principles to automate deployments and manage container orchestration.
Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
Develop automation tools and scripts to improve operational efficiency.
Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
Collaborate with dev teams to define and implement SLOs/SLIs
Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted about 14 hours ago

Apply

🔥 Sr Site Reliability Engineer (SRE)

Posted 3 days ago

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
5+ years of experience with a DevOps or SRE job title
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 3 days ago

Apply

🔥 Snr. Site Reliability Engineer (Remote)

Posted 13 days ago

📍 United States

🧭 Full-Time

💸 130000.0 - 165000.0 USD per year

🔍 Software Development

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity almost 2 years agoComputer Security Cyber Security Network Security Software

🔧 Requirements

BS/MS/Ph.D. or equivalent plus 5 years experience
Proficient authoring scripts in one or more programming languages (e.g. Python, Ruby, Javascript).
Experience designing and operating high-scale patterns in AWS
Experience building and designing repeatable workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
Excellent communication skills
Effectively able to self-manage your time across competing projects
Ability to quickly understand and debug complex distributed systems

💡 Responsibilities

Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
Maintain and improve extensible infrastructure-as-code using Terraform
Learn, maintain, and improve our existing deployment strategies
Deliver effective observability, monitoring, and alerting patterns for KnowBe4’s applications and infrastructure
Act as an escalation point for identifying and resolving the root cause for production incidents
Provide assistance designing globally distributed systems and processes for the organization
Identify deficiencies in our current applications and infrastructure and correct them when found
Define new approaches and tailored solutions to complex technical problems
Act as a project leader with other Site Reliability Engineers and ensure progress is communicated effectively to project stakeholders

AWSDockerPythonSQLAWS EKSCloud ComputingDynamoDBKubernetesAlgorithmsData StructuresREST APIRustCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsScriptingDebugging

Posted 13 days ago

Apply

🔥 Site Reliability Engineer - Remote

Posted 15 days ago

📍 United States

🧭 Full-Time

💸 185000.0 - 200000.0 USD per year

🔍 Software Development

🔧 Requirements

Linux System Administration
Experience supporting production environments running Ruby on Rails applications.
Proficient with cloud platforms such as AWS, GCP, or Azure.
Experience with EC2, RDS, VPCs, and security groups is essential.
Ansible or equivalent experience for managing large fleets of EC2 or similar servers.
Expert in using Terraform for infrastructure as code.
Strong experience with Kubernetes and Docker, including deployment, scaling, and management of containerized applications.
Extensive experience with monitoring and observability tools like Datadog, Prometheus, Grafana, ELK stack, or Splunk.
Ability to work with other Engineering team members on troubleshooting, support, and projects both for Production and lower level environments.
Deep understanding of DevOps principles, practices, and tools to drive continuous improvement in the software development lifecycle.

💡 Responsibilities

Support our EC2 infrastructure to ensure it’s properly configured, reliable, and monitored, while also helping us modernize it towards more automation and containerization.
Build and maintain our Ansible (and legacy Puppet) configuration management, while helping us increase our automation and reduce toil.
Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker.
Implement best practices for container orchestration and management.
Develop and maintain comprehensive monitoring and observability solutions using Datadog.
Create, enhance, and maintain continuous integration and continuous deployment pipelines using GitLab CI.
Implement security best practices and ensure compliance with industry standards.
Work closely with development teams to ensure reliability and scalability of new features and services.
Provide technical support and guidance on infrastructure-related issues.
Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

AWSDockerCloud ComputingKubernetesRuby on RailsCI/CDLinuxDevOpsTerraformMicroservicesAnsible

Posted 15 days ago

Apply

🔥 Site Reliability Engineer - Assistant Vice President

Posted 17 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Financial Technology

🏢 Company: iCapital👥 51-100 Business Intelligence

🔧 Requirements

5+ years of SRE or related experience with 3+ years in AWS
Strong experience with Kubernetes
Working knowledge of MongoDB, Postgres, DynamoDB
Experience defining and implementing SLOs/SLIs
Skills in IaC (Terraform preferred) and programming languages (Python, Ruby, Java)
Experience with modern observability practices (Prometheus, Grafana, etc.)
Strong incident response skills
Excellent problem-solving abilities

💡 Responsibilities

Design, implement, and maintain service level objectives (SLOs)
Develop observability strategies
Architect scalable infrastructure solutions
Drive automation initiatives
Champion reliability best practices
Design and operate Kubernetes environment
Lead incident response and postmortems
Participate in on-call rotations

AWSPostgreSQLPythonDynamoDBKubernetesMongoDBGrafanaPrometheusTerraform

Posted 17 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 21 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

NOT STATED

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 21 days ago

Apply

🔥 Lead Site Reliability Engineer

Posted 22 days ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Neon Inc.

🔧 Requirements

2+ years in an Engineering Management role, plus 5+ years of hands-on coding experience.
Strong background in leading/building teams that build cloud services or platforms.
Proven ability to lead and scale distributed teams across multiple time zones.
Strong mentoring skills, high emotional intelligence, and exceptional prioritization abilities.
Experience planning, shipping, and iterating on complex infrastructure projects with predictability.
Cloud: Azure and/or AWS experience
Infrastructure: Kubernetes (multi-cluster, multi-cloud), Linux environments
Monitoring: Prometheus ecosystem (Grafana, Loki, Tempo, VictoriaMetrics)
Scalable & Repeatable Infrastructure: Focus on efficiency and automation
Debugging & Innovation: Love solving challenges with no easy answers
Native or near-native verbal and written skills.

💡 Responsibilities

Manage a high-performing distributed team (5+ engineers across the EU), creating a culture of growth, collaboration, and innovation.
Remove Roadblocks: Identify and eliminate obstacles to maximize productivity and efficiency.
Coach & Mentor: Spend significant time helping engineers grow, supporting career development, and evaluating performance.
Optimize & Scale: Work closely with tech leads and product managers to refine processes, tackle tech debt, and ensure fast, high-quality delivery.
Enhance Communication: Foster strong collaboration within the team and across departments.
Drive Strategic Impact: Align infrastructure projects with broader business goals to maximize effectiveness.
Maintain Reliability: Ensure a healthy and scalable on-call process for the team.
Grow the Team: Expand our impact by recruiting and hiring top-tier Software Engineers.

AWSKubernetesAzureGoGrafanaPostgresPrometheusLinux

Posted 22 days ago

Apply

🔥 Site Reliability Engineer

Posted 24 days ago

📍 United States

🧭 Full-Time

🔍 AI Infrastructure

🏢 Company: Voltage Park👥 1-10💰 $500,000,000 over 1 year agoCloud Computing Machine Learning

🔧 Requirements

8+ years working with Linux
5+ years experience with AWS
2+ years experience with Kubernetes
Experience with Terraform and Ansible
Experience with network attached storage management

💡 Responsibilities

Design and build new platforms
Deploy updates to support internal and customer use cases
Collaborate with network engineering, software development, and customer support
Participate in the SRE on-call rotation

AWSPythonBashKubernetesGoPrometheusLinuxTerraformNetworkingAnsible

Posted 24 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 26 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements