Apply

Site Reliability Engineer

Posted about 5 hours agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ“ Location: United States

πŸ” Industry: Software Development

🏒 Company: TCGPlayer_External_Career

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

πŸͺ„ Skills: AWSDockerCloud ComputingGCPKubernetesAzureCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Requirements:
  • 5+ years of experience in Site Reliability Engineering or related roles
  • Experience with an enterprise monitoring solution (New Relic, Scalyr, Datadog, Etc)
  • Experience managing Linux and/or Windows environments
  • Experience with IaaS and PaaS solutions (i.e. AWS, GCP, Azure, etc.)
  • Experience with Infrastructure as Code (Terraform or Helm)
  • Knowledge of Kubernetes / ECS orchestration, and containerization (e.g. Docker)
  • Demonstrable expertise around specifying, designing and/or implementing system health, performance monitoring tools and software management tools for 24x7 environments
  • Proficiency in writing code / scripts to automate tasks
  • Excellent critical thinking and solving skills
Responsibilities:
  • Innovate, build, and evangelize the practice of site reliability so that TCGPlayer can deliver excellent customer experiences.
  • Define and measure key performance metrics, such as SLAs and Mean Time Between Failures (MTBF), using those metrics to identify trends and measure the impact on the business.
  • Develop and maintain up-to-date operational procedures, including runbooks, to adapt to evolving needs.
  • Anticipate system failures through practices like chaos engineering and tabletop exercises, and establish processes to learn from operational incidents.
  • Foster strong relationships within the team and across departments while cultivating a communicative, supportive, and results-oriented culture.
Apply

Related Jobs

Apply

πŸ“ United States

πŸ” Blockchain

🏒 Company: IO Global

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted about 14 hours ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 165000.0 - 205000.0 USD per year

πŸ” Software Development

🏒 Company: CriblπŸ‘₯ 251-500πŸ’° $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 3 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 130000.0 - 165000.0 USD per year

πŸ” Software Development

🏒 Company: KnowBe4πŸ‘₯ 1001-5000πŸ’° $300,000,000 Post-IPO Equity almost 2 years agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • BS/MS/Ph.D. or equivalent plus 5 years experience
  • Proficient authoring scripts in one or more programming languages (e.g. Python, Ruby, Javascript).
  • Experience designing and operating high-scale patterns in AWS
  • Experience building and designing repeatable workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
  • Excellent communication skills
  • Effectively able to self-manage your time across competing projects
  • Ability to quickly understand and debug complex distributed systems
  • Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
  • Maintain and improve extensible infrastructure-as-code using Terraform
  • Learn, maintain, and improve our existing deployment strategies
  • Deliver effective observability, monitoring, and alerting patterns for KnowBe4’s applications and infrastructure
  • Act as an escalation point for identifying and resolving the root cause for production incidents
  • Provide assistance designing globally distributed systems and processes for the organization
  • Identify deficiencies in our current applications and infrastructure and correct them when found
  • Define new approaches and tailored solutions to complex technical problems
  • Act as a project leader with other Site Reliability Engineers and ensure progress is communicated effectively to project stakeholders

AWSDockerPythonSQLAWS EKSCloud ComputingDynamoDBKubernetesAlgorithmsData StructuresREST APIRustCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsScriptingDebugging

Posted 13 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 185000.0 - 200000.0 USD per year

πŸ” Software Development

  • Linux System Administration
  • Experience supporting production environments running Ruby on Rails applications.
  • Proficient with cloud platforms such as AWS, GCP, or Azure.
  • Experience with EC2, RDS, VPCs, and security groups is essential.
  • Ansible or equivalent experience for managing large fleets of EC2 or similar servers.
  • Expert in using Terraform for infrastructure as code.
  • Strong experience with Kubernetes and Docker, including deployment, scaling, and management of containerized applications.
  • Extensive experience with monitoring and observability tools like Datadog, Prometheus, Grafana, ELK stack, or Splunk.
  • Ability to work with other Engineering team members on troubleshooting, support, and projects both for Production and lower level environments.
  • Deep understanding of DevOps principles, practices, and tools to drive continuous improvement in the software development lifecycle.
  • Support our EC2 infrastructure to ensure it’s properly configured, reliable, and monitored, while also helping us modernize it towards more automation and containerization.
  • Build and maintain our Ansible (and legacy Puppet) configuration management, while helping us increase our automation and reduce toil.
  • Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker.
  • Implement best practices for container orchestration and management.
  • Develop and maintain comprehensive monitoring and observability solutions using Datadog.
  • Create, enhance, and maintain continuous integration and continuous deployment pipelines using GitLab CI.
  • Implement security best practices and ensure compliance with industry standards.
  • Work closely with development teams to ensure reliability and scalability of new features and services.
  • Provide technical support and guidance on infrastructure-related issues.
  • Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

AWSDockerCloud ComputingKubernetesRuby on RailsCI/CDLinuxDevOpsTerraformMicroservicesAnsible

Posted 15 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Financial Technology

🏒 Company: iCapitalπŸ‘₯ 51-100Business Intelligence

  • 5+ years of SRE or related experience with 3+ years in AWS
  • Strong experience with Kubernetes
  • Working knowledge of MongoDB, Postgres, DynamoDB
  • Experience defining and implementing SLOs/SLIs
  • Skills in IaC (Terraform preferred) and programming languages (Python, Ruby, Java)
  • Experience with modern observability practices (Prometheus, Grafana, etc.)
  • Strong incident response skills
  • Excellent problem-solving abilities
  • Design, implement, and maintain service level objectives (SLOs)
  • Develop observability strategies
  • Architect scalable infrastructure solutions
  • Drive automation initiatives
  • Champion reliability best practices
  • Design and operate Kubernetes environment
  • Lead incident response and postmortems
  • Participate in on-call rotations

AWSPostgreSQLPythonDynamoDBKubernetesMongoDBGrafanaPrometheusTerraform

Posted 17 days ago
Apply
Apply

πŸ“ United States, European timezones

🧭 Full-Time

πŸ” Software Development

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

NOT STATED
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 21 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: Neon Inc.

  • 2+ years in an Engineering Management role, plus 5+ years of hands-on coding experience.
  • Strong background in leading/building teams that build cloud services or platforms.
  • Proven ability to lead and scale distributed teams across multiple time zones.
  • Strong mentoring skills, high emotional intelligence, and exceptional prioritization abilities.
  • Experience planning, shipping, and iterating on complex infrastructure projects with predictability.
  • Cloud: Azure and/or AWS experience
  • Infrastructure: Kubernetes (multi-cluster, multi-cloud), Linux environments
  • Monitoring: Prometheus ecosystem (Grafana, Loki, Tempo, VictoriaMetrics)
  • Scalable & Repeatable Infrastructure: Focus on efficiency and automation
  • Debugging & Innovation: Love solving challenges with no easy answers
  • Native or near-native verbal and written skills.
  • Manage a high-performing distributed team (5+ engineers across the EU), creating a culture of growth, collaboration, and innovation.
  • Remove Roadblocks: Identify and eliminate obstacles to maximize productivity and efficiency.
  • Coach & Mentor: Spend significant time helping engineers grow, supporting career development, and evaluating performance.
  • Optimize & Scale: Work closely with tech leads and product managers to refine processes, tackle tech debt, and ensure fast, high-quality delivery.
  • Enhance Communication: Foster strong collaboration within the team and across departments.
  • Drive Strategic Impact: Align infrastructure projects with broader business goals to maximize effectiveness.
  • Maintain Reliability: Ensure a healthy and scalable on-call process for the team.
  • Grow the Team: Expand our impact by recruiting and hiring top-tier Software Engineers.

AWSKubernetesAzureGoGrafanaPostgresPrometheusLinux

Posted 22 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ” AI Infrastructure

🏒 Company: Voltage ParkπŸ‘₯ 1-10πŸ’° $500,000,000 over 1 year agoCloud ComputingMachine Learning

  • 8+ years working with Linux
  • 5+ years experience with AWS
  • 2+ years experience with Kubernetes
  • Experience with Terraform and Ansible
  • Experience with network attached storage management
  • Design and build new platforms
  • Deploy updates to support internal and customer use cases
  • Collaborate with network engineering, software development, and customer support
  • Participate in the SRE on-call rotation

AWSPythonBashKubernetesGoPrometheusLinuxTerraformNetworkingAnsible

Posted 24 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 26 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” FinTech

  • Proficiency in Golang
  • Experience with AWS services
  • Familiarity with Kubernetes and Docker
  • Understanding of infrastructure as code with Terraform
  • Experience with observability tools like Prometheus and Grafana
  • Develop robust alerting systems
  • Refine incident management process
  • Design and maintain observability platform

AWSDockerKubernetesGrafanaPrometheusTerraform

Posted 29 days ago
Apply