Apply

Senior Site Reliability Engineer

Posted 2024-09-19

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: United States, Canada

💸 Salary: $139,000 - $218,000 per year

🔍 Industry: Web Development

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSDockerPythonGCPKubernetesMongoDBGoReact

Requirements:
  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.
Responsibilities:
  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

🔍 Legal technology

🏢 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

📍 Canada

🔍 Software Supply Chain Management

🏢 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

📍 U.S.

🧭 Full-Time

💸 140000 - 160000 USD per year

🔍 Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Security and fraud detection

🏢 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

📍 United States

🧭 Full-Time

💸 150000 - 230000 USD per year

🔍 Public safety technology

🏢 Company: Axon

  • This position involves handling of classified federal data; under federal regulations, it is open to US Citizens only.
  • 10+ years of applicable experience.
  • Experience managing cloud platforms such as Azure, AWS, or similar.
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar.
  • Experience using managed languages such as Python, Go, C#, Java, or similar.
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases.
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues.
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems.
  • Familiarity with building flexible and testable Infrastructure as Code modules.
  • Empathy to support the needs of software engineers.

  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely.
  • Exemplify cloud-native site reliability best practices.
  • Write code that is performant, maintainable, clear, and concise.
  • Employ strong problem-solving skills to debug problems in cloud-native distributed systems.
  • Influence and educate the engineering organization to adopt new and improved architectural patterns.
  • Provide robust documentation for use by engineers to promote self-service.
  • Take calculated risks, champion new ideas, and cultivate your craft.

AWSPythonJavaKubernetesC#AzureGoCI/CD

Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply
Apply

📍 CA, CO, CT, FL, GA, IL, IN, KY, MA, MI, MN, NC, NJ, NY, OH, OR, PA, SC, TN, TX, UT, VA, WA, WI

💸 145000 - 175000 USD per year

🔍 Benefits and employee experience

🏢 Company: Jellyvision

  • Demonstrated experience with cloud computing platforms, particularly AWS.
  • Proficient in programming languages including Ruby, Python, and JavaScript.
  • Experienced with configuration management tools such as Ansible, Packer, CloudFormation, and a strong emphasis on Terraform.
  • Skilled in container technologies and orchestration tools like Docker, ECS, and Kubernetes.
  • Experience with continuous integration tools such as GitLab, GitHub, and Jenkins.
  • Knowledge of best practices for monitoring and alerting to ensure system reliability.
  • Exceptional communication skills with various stakeholders.
  • Strong data-driven decision-making capabilities.

  • Design applications by advising development teams on best practices and architecting solutions for optimal performance.
  • Optimize CI/CD pipelines through strategic guidance, minimizing manual tasks, and enhancing operational efficiency.
  • Monitor systems by efficiently resolving alerts, participating in on-call rotations, and supporting application management.
  • Mentor team members by providing guidance, seeking continuous learning opportunities, and giving constructive feedback.

AWSDockerPythonCloud ComputingJavascriptJenkinsKubernetesRubyJavaScriptCommunication SkillsCollaborationCI/CD

Posted 2024-10-21
Apply
Apply

📍 USA

💸 160000 - 195000 USD per year

🔍 Healthcare

🏢 Company: Garner Health

  • 5+ years of experience delivering software solutions.
  • 4+ years of hands-on production work with cloud infrastructure, containers, monitoring, and alerting.
  • 3+ years working in a security-conscious environment.
  • Expertise and experience leading cloud-first/only projects, preferably on AWS.
  • Expertise with Terraform.
  • Experience with Kubernetes.
  • Experience with Go and Python, particularly utilizing Kubernetes APIs.

  • Architect, operate, improve, and secure the platform the Garner Health app runs on.
  • Boost developer productivity.
  • Build systems to a high engineering standard and ensure others adhere to these standards.
  • Research and advocate for improved techniques, processes, and designs.
  • Collaborate with teammates on strategic platform initiatives.
  • Support the Garner platform in production.
  • Ensure security in production according to regulatory requirements.
  • Partner with stakeholders to maintain product availability and performance.

AWSPythonKubernetesGoTerraform

Posted 2024-10-21
Apply
Apply

📍 Canada

🔍 Software development for small businesses

🏢 Company: Jobber

  • Demonstrated expertise in providing systems support within a cloud environment, preferably AWS and its various services.
  • Experience with IaC using Terraform.
  • Experience optimizing and improving continuous deployment performance.
  • Ability to juggle multiple projects and incident management.
  • Relevant experience in programming languages such as Ruby, Python, or Bash.
  • Deep passion for learning, reliability, automation, orchestration, and continuous improvement.
  • Strong commitment to problem-solving and keen interest in technology.
  • Exceptional interpersonal skills for collaboration in high-pressure situations.

  • Collaborate on the design, implementation, operation, and maintenance of AWS infrastructure.
  • Leverage Infrastructure-as-Code principles.
  • Develop and maintain local tooling for development and observability tools, including deployment tools and interfaces into AWS, CircleCI, and other infrastructure components.
  • Participate in on-call rotation and contribute to enhancing the on-call experience for the team.

AWSPythonBashRubyRuby on RailsReactCollaborationProblem SolvingTerraform

Posted 2024-10-16
Apply
Apply

📍 United States

💸 $161,000 - $180,000 per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

  • STEM degree and relevant experience as a Site Reliability Engineer
  • Exceptional problem solving skills
  • High proficiency in one of the following: C, C++, Java, Python, Go, etc.
  • High proficiency in Unix/Linux environment, excellent knowledge of internals (e.g., filesystems, system calls)
  • Networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
  • Experience in database administration and configuration
  • Experience with DevOps tools such as Terraform, Ansible, Docker, Kubernetes
  • On call reporting to monitoring and alerting of core website functions as needed

  • Performance analysis to identify sources of instability using data from APM and distributed telemetry data tools
  • Analyze complex systems to identify operational surprises and minimize downtime.
  • Software engineering and patching in to incrementally improve performance, scalability, and reliability
  • Infrastructure modifications in both a data center metal environment with advanced routing/switching and in the public cloud
  • Predictive failure analysis and disaster planning
  • Author new tools and automation to streamline the DevOps pipeline
  • Collaborate with other engineering teams
  • Database and kv store administration and configuration with a focus on uptime and performance
  • Incident response and postmortem reports

DockerPythonJavaKubernetesTerraform

Posted 2024-10-05
Apply