Apply

Senior Site Reliability Engineer

Posted 2024-10-21

View full description

💎 Seniority level: Senior, Minimum 5 years

📍 Location: AL, AZ, CA, CO, CT, FL, GA, ID, IL, IN, IA, KY, ME, MD, MA, MI, MN, MO, NV, NJ, NY, NC, OH, OR, PA, TN, TX, VA, WA, WI

💸 Salary: 110000 - 135000 USD per year

🔍 Industry: Childcare software

🏢 Company: Procare Solutions

🗣️ Languages: English

⏳ Experience: Minimum 5 years

🪄 Skills: AWSDockerPythonBashElasticSearchJenkinsKibanaKubernetesElasticsearchGoGrafanaPrometheusCommunication SkillsCollaborationCI/CDProblem Solving

Requirements:
  • Minimum 5 years' of hands-on experience with AWS services including EC2, S3, RDS, Lambda, ECS/EKS.
  • Deep knowledge and extensive experience with Linux operating systems, including system administration and troubleshooting.
  • Familiarity with common SRE-related tools such as Kubernetes, Docker, Prometheus, Grafana, and the ELK stack.
  • Proficiency in infrastructure as code (IaC) tools like Terraform, Ansible, and CloudFormation.
  • Experience with monitoring solutions, including metrics setup and creating alerts.
  • Strong understanding of networking concepts, including DNS, load balancing, and firewalls.
  • Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
  • Excellent problem-solving skills with a proactive and analytical approach.
  • Strong written and verbal communication skills, with the ability to collaborate effectively.
  • Experience in DevOps engineering, including CI/CD practices and tools.
Responsibilities:
  • Design, implement, and maintain scalable, reliable, and secure AWS infrastructure using best practices.
  • Develop and maintain monitoring, logging, and alerting solutions to ensure system health and performance.
  • Automate infrastructure provisioning, configuration, and deployment processes using tools like Terraform and Ansible.
  • Respond to production incidents, conduct root cause analysis, and implement corrective measures.
  • Continuously analyze system performance and implement tuning improvements.
  • Ensure systems comply with security best practices and manage IAM roles and policies.
  • Collaborate with development teams on reliability integration into the software development lifecycle.
  • Maintain comprehensive documentation of infrastructure and processes.
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

🔍 Legal technology

🏢 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

📍 U.S.

🧭 Full-Time

💸 140000 - 160000 USD per year

🔍 Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

📍 United States

🧭 Full-Time

💸 150000 - 230000 USD per year

🔍 Public safety technology

🏢 Company: Axon

  • This position involves handling of classified federal data; under federal regulations, it is open to US Citizens only.
  • 10+ years of applicable experience.
  • Experience managing cloud platforms such as Azure, AWS, or similar.
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar.
  • Experience using managed languages such as Python, Go, C#, Java, or similar.
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases.
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues.
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems.
  • Familiarity with building flexible and testable Infrastructure as Code modules.
  • Empathy to support the needs of software engineers.

  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely.
  • Exemplify cloud-native site reliability best practices.
  • Write code that is performant, maintainable, clear, and concise.
  • Employ strong problem-solving skills to debug problems in cloud-native distributed systems.
  • Influence and educate the engineering organization to adopt new and improved architectural patterns.
  • Provide robust documentation for use by engineers to promote self-service.
  • Take calculated risks, champion new ideas, and cultivate your craft.

AWSPythonJavaKubernetesC#AzureGoCI/CD

Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply
Apply

📍 CA, CO, CT, FL, GA, IL, IN, KY, MA, MI, MN, NC, NJ, NY, OH, OR, PA, SC, TN, TX, UT, VA, WA, WI

💸 145000 - 175000 USD per year

🔍 Benefits and employee experience

🏢 Company: Jellyvision

  • Demonstrated experience with cloud computing platforms, particularly AWS.
  • Proficient in programming languages including Ruby, Python, and JavaScript.
  • Experienced with configuration management tools such as Ansible, Packer, CloudFormation, and a strong emphasis on Terraform.
  • Skilled in container technologies and orchestration tools like Docker, ECS, and Kubernetes.
  • Experience with continuous integration tools such as GitLab, GitHub, and Jenkins.
  • Knowledge of best practices for monitoring and alerting to ensure system reliability.
  • Exceptional communication skills with various stakeholders.
  • Strong data-driven decision-making capabilities.

  • Design applications by advising development teams on best practices and architecting solutions for optimal performance.
  • Optimize CI/CD pipelines through strategic guidance, minimizing manual tasks, and enhancing operational efficiency.
  • Monitor systems by efficiently resolving alerts, participating in on-call rotations, and supporting application management.
  • Mentor team members by providing guidance, seeking continuous learning opportunities, and giving constructive feedback.

AWSDockerPythonCloud ComputingJavascriptJenkinsKubernetesRubyJavaScriptCommunication SkillsCollaborationCI/CD

Posted 2024-10-21
Apply
Apply

📍 USA

💸 160000 - 195000 USD per year

🔍 Healthcare

🏢 Company: Garner Health

  • 5+ years of experience delivering software solutions.
  • 4+ years of hands-on production work with cloud infrastructure, containers, monitoring, and alerting.
  • 3+ years working in a security-conscious environment.
  • Expertise and experience leading cloud-first/only projects, preferably on AWS.
  • Expertise with Terraform.
  • Experience with Kubernetes.
  • Experience with Go and Python, particularly utilizing Kubernetes APIs.

  • Architect, operate, improve, and secure the platform the Garner Health app runs on.
  • Boost developer productivity.
  • Build systems to a high engineering standard and ensure others adhere to these standards.
  • Research and advocate for improved techniques, processes, and designs.
  • Collaborate with teammates on strategic platform initiatives.
  • Support the Garner platform in production.
  • Ensure security in production according to regulatory requirements.
  • Partner with stakeholders to maintain product availability and performance.

AWSPythonKubernetesGoTerraform

Posted 2024-10-21
Apply
Apply

📍 United States

💸 $161,000 - $180,000 per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

  • STEM degree and relevant experience as a Site Reliability Engineer
  • Exceptional problem solving skills
  • High proficiency in one of the following: C, C++, Java, Python, Go, etc.
  • High proficiency in Unix/Linux environment, excellent knowledge of internals (e.g., filesystems, system calls)
  • Networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
  • Experience in database administration and configuration
  • Experience with DevOps tools such as Terraform, Ansible, Docker, Kubernetes
  • On call reporting to monitoring and alerting of core website functions as needed

  • Performance analysis to identify sources of instability using data from APM and distributed telemetry data tools
  • Analyze complex systems to identify operational surprises and minimize downtime.
  • Software engineering and patching in to incrementally improve performance, scalability, and reliability
  • Infrastructure modifications in both a data center metal environment with advanced routing/switching and in the public cloud
  • Predictive failure analysis and disaster planning
  • Author new tools and automation to streamline the DevOps pipeline
  • Collaborate with other engineering teams
  • Database and kv store administration and configuration with a focus on uptime and performance
  • Incident response and postmortem reports

DockerPythonJavaKubernetesTerraform

Posted 2024-10-05
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 2024-09-19
Apply
Apply

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 2024-09-13
Apply