Apply

Site Reliability Engineer

Posted 2024-10-23

View full description

πŸ“ Location: United States

🏒 Company: Jahnel Group

πŸ—£οΈ Languages: English

πŸͺ„ Skills: AWSDockerPythonCybersecurityCommunication Skills

Requirements:
  • Strong written and verbal communication skills.
  • Exceptional planning, organizational, and problem-solving abilities.
  • Ability to thrive in a fast-paced environment.
  • Advanced understanding of Windows (Windows 11, Windows Server 2022) and Linux systems (Redhat, SSH).
  • Strong scripting skills (Python) and Infrastructure as Code (IaC) experience with Terraform and Docker.
  • Deep knowledge of AWS infrastructure, including CloudFormation, AWS CDK, and Terraform.
  • Advanced understanding of enterprise networking, VPN, 802.1x authentication, and cybersecurity tools like NGAV and EDR.
  • Knowledge of security standards such as NIST and STIG.
  • Certifications like MCSE, CompTIA Server+, or Red Hat Certified System Administrator are a plus.
Responsibilities:
  • Monitor the health of servers, databases, networks, and security.
  • Optimize cybersecurity tools like antivirus and spam filtering.
  • Manage security patches and updates, including sandbox testing.
  • Plan and execute upgrades and security compliance projects.
  • Oversee vendor relationships and manage software licenses.
  • Scale servers/services for changing loads.
  • Analyze and resolve security and vulnerability threats.
  • Respond to and resolve server monitoring alerts.
  • Create hardened images for servers.
  • Install and update servers, cloud services, and applications.
  • Review configurations for services like antivirus, VPN, and MFA.
  • Manage server and email certificates.
  • Resolve security issues from penetration tests.
  • Conduct fire drills for incidents and ensure compliance with PCI-DSS v4 and NIST standards.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ” Legal technology

🏒 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 102600 - 120323 USD per year

πŸ” Recycling technology

🏒 Company: AMP Sortation

  • Strong technical communication skills for ticket escalations.
  • Strong interpersonal skills for communicating with individuals impacted by downtime.
  • Experience troubleshooting Linux systems.
  • Demonstrated coding experience in C++ or Rust.
  • Desire to learn professional software engineering practices.
  • Proficiency in managing tasks under sprint or kanban methodology.
  • Passion for green technology and emissions reduction.

  • Triage and respond to tickets during core working hours.
  • Troubleshoot operating system, hardware, networking, and application issues.
  • Maintain documentation for engineering support.
  • Define improvements to the Jira ticketing system.
  • Develop and support AMP's observability stack.

C++JiraGrafanaPrometheusRustCommunication SkillsLinuxDocumentation

Posted 2024-11-21
Apply
Apply

πŸ“ USA

πŸ’Έ 170000 - 190000 USD per year

πŸ” Email Security

🏒 Company: Valimail

  • 5+ years experience building and maintaining highly available relational databases.
  • Work collaboratively with cross functional teams
  • Value team success over individual success
  • Put industry and engineering best practices into practice and promotes them to others
  • Passion for reliable, scalable, and performant datastores with strong sense of ownership
  • Experience building and supporting highly performant and highly reliable datastores
  • Deep experience working with Postgres
  • Expert in database fundamentals, SQL, PL/pgSQL, (or other)
  • Experience with NOSQL datastores and caching solutions
  • Working knowledge of AWS or Azure cloud providers
  • Experience with Infrastructure-as-Code tools, such as Terraform

  • Evangelizing standard methodologies for building and operating highly reliable data storage systems
  • Serving as the subject matter expert in datastore design and performance
  • Building and supporting Valimail’s mission-critical datastores
  • Conducting timely post mortems of production datastore incidents
  • Collaboratively designing systems with other engineers to meet reliability, scalability, and performance requirements
  • Providing assistance to teams working with datastores
  • Automating routine database tasks
  • Participating in on-call rotation and incident response.
  • Upgrade data storage systems as necessary

AWSSQLAzurePostgresNosqlTerraform

Posted 2024-11-20
Apply
Apply

πŸ“ U.S.

🧭 Full-Time

πŸ’Έ 140000 - 160000 USD per year

πŸ” Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 204000 - 281000 USD per year

πŸ” Cybersecurity

🏒 Company: SentinelOne

  • Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment.
  • 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments.
  • Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, and SLO frameworks.
  • Proficiency in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting.
  • Experience with machine learning and data analytics for real-time alert systems.
  • Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes).
  • Ability to make critical architectural decisions focused on business impact and system performance.

  • Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks for a microservices SaaS architecture.
  • Ensure solutions align with business priorities and customer impact goals.
  • Define, implement, and monitor SLOs in collaboration with product and engineering teams.
  • Establish reliability standards to drive accountability around service performance.
  • Partner with software engineers, SREs, and data scientists to implement monitoring, alerting, and SLO solutions.
  • Lead initiatives promoting best practices across SentinelOne engineering.
  • Mentor engineers and contribute to a culture of reliability engineering excellence.

AWSLeadershipPythonData AnalysisGCPJavaKubernetesMachine LearningAzureData analysisGoCollaborationTerraformMicroservices

Posted 2024-11-15
Apply
Apply

πŸ“ United States

πŸ’Έ 192000 - 288000 USD per year

πŸ” Frontend Cloud and web services

🏒 Company: Vercel

  • At least 3 years of experience in an SRE role, or at least 5 years of experience in an adjacent role (e.g., platform engineering), operating in a scaled environment.
  • Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.
  • Strong sense of accountability and commitment to problem-solving, backed by curiosity to dig deep and identify root causes.
  • Willingness to proactively engage with development teams to influence the course of software design and operational practices.
  • Capability to manage risk, make decisions, and exhibit sound judgment.
  • Demonstrated ability to plan and deliver long-term projects.
  • Familiarity with networking protocols and application serving.
  • Experience deploying and operating systems on AWS infrastructure at scale.
  • Bonus: Experience working with Terraform, Kubernetes, Golang, and/or Lua.

  • Ensure that our products are built for reliability and scale by engaging in the end-to-end design, development, and deployment of new software.
  • Drive continuous risk mitigation and reduction through direct involvement in incident management, blameless postmortems, and follow-ups.
  • Drive measurable improvements to the reliability, performance, and efficiency of our production systems through instrumentation, analysis, and implementation of engineering improvements.
  • Devise repeatable, low-toil operational practices through the development of automated systems for software delivery, system failover, and capacity management.

AWSProblem Solving

Posted 2024-11-13
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 147100 - 207600 USD per year

πŸ” Cloud Infrastructure and Software Engineering

🏒 Company: HashiCorp

  • Professional experience designing or operating disaster recovery processes in a distributed cloud environment.
  • Professional experience with incident management in cloud environments.
  • Enjoy working on various scopes spanning software engineering, cloud infrastructure, and SRE.
  • Experience contributing to efficiency improvements of software at scale.
  • Experience collaborating cross-functionally to deliver engineering culture change.
  • Worked on infrastructure teams in customer-centric and agile organizations with empathy and compassion.
  • Worked with SaaS or other managed software offerings.
  • Experience in one or more of the major public clouds.

  • Utilize software engineering experience to solve problems and build automation for incident lifecycle management.
  • Coordinate disaster recovery processes and identify strategic process improvements.
  • Drive incident management capabilities and culture.
  • Participate in incident command on-call rotation.
  • Support incident management tooling.
  • Build technical skills and relationships within a team of engineers and SREs.
  • Learn, teach, and collaborate cross-functionally.

AgileProduct DevelopmentStrategyCommunication SkillsCollaboration

Posted 2024-11-12
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Security and fraud detection

🏒 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

πŸ“ CA, CO, CT, FL, GA, HI, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA

🧭 Full-Time

πŸ’Έ 135520 - 178060 USD per year

πŸ” Non-profit mental health support

🏒 Company: Crisis Text Line

  • Bachelor's degree in Computer Science, Engineering, or related field; Master’s preferred.
  • Proven experience as a Staff SRE or in a similar role.
  • Maintaining reliability of online SaaS/PaaS.
  • Proficiency in AWS and infrastructure as code (Terraform, CloudFormation).
  • Strong scripting skills (Python) and knowledge of containerization (Docker, Kubernetes).
  • Experience in CI/CD pipelines and observability tools (GitHub Actions, Datadog).
  • Understanding of network protocols and security principles.

  • Assisting to lead and mentor a team of 5 SREs.
  • Designing, implementing, and maintaining AWS infrastructure.
  • Collaborating with developers for performance optimization.
  • Developing monitoring, logging, and alerting systems.
  • Automating repetitive tasks to improve efficiency.
  • Responding to incidents to minimize downtime.
  • Supporting diversity on the engineering team.
  • Communicating expectations and progress clearly.
  • Providing mentorship and promoting technical best practices.
  • Participating in retrospectives to improve processes.
  • Conducting regular security audits.

AWSDockerGraphQLPHPPythonGCPKubernetesAzureData StructuresGoNext.jsCommunication SkillsCollaborationCI/CDDevOpsTerraformCompliance

Posted 2024-11-09
Apply
Apply

πŸ“ US

🧭 Full-Time

πŸ’Έ 198000 - 220000 USD per year

πŸ” Blockchain, Cryptocurrency

🏒 Company: Uniswap Labs

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in site reliability engineering, DevOps, or related fields.
  • Strong understanding of reliability engineering principles and tools.
  • Proficiency in monitoring tools like Prometheus, Grafana, Nagios.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
  • Proficiency in scripting tools such as Python, Bash, Ansible, or Terraform.

  • Design, implement, and maintain systems for reliability, availability, and performance of services.
  • Develop and manage monitoring, alerting, and incident response strategies.
  • Conduct root cause analysis of failures.
  • Collaborate with cross-functional teams on reliability practices.
  • Drive improvements and innovations in systems and processes.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCollaborationCI/CDDevOps

Posted 2024-11-07
Apply