Apply

Senior Site Reliability Engineer

Posted 2024-11-16

View full description

πŸ’Ž Seniority level: Senior

πŸ“ Location: U.S.

πŸ’Έ Salary: 140000 - 160000 USD per year

πŸ” Industry: Cybersecurity / Open source software

πŸͺ„ Skills: PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Requirements:
  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.
Responsibilities:
  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ” Legal technology

🏒 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

πŸ“ USA

πŸ’Έ 170000 - 190000 USD per year

πŸ” Email Security

🏒 Company: Valimail

  • 5+ years experience building and maintaining highly available relational databases.
  • Work collaboratively with cross functional teams
  • Value team success over individual success
  • Put industry and engineering best practices into practice and promotes them to others
  • Passion for reliable, scalable, and performant datastores with strong sense of ownership
  • Experience building and supporting highly performant and highly reliable datastores
  • Deep experience working with Postgres
  • Expert in database fundamentals, SQL, PL/pgSQL, (or other)
  • Experience with NOSQL datastores and caching solutions
  • Working knowledge of AWS or Azure cloud providers
  • Experience with Infrastructure-as-Code tools, such as Terraform

  • Evangelizing standard methodologies for building and operating highly reliable data storage systems
  • Serving as the subject matter expert in datastore design and performance
  • Building and supporting Valimail’s mission-critical datastores
  • Conducting timely post mortems of production datastore incidents
  • Collaboratively designing systems with other engineers to meet reliability, scalability, and performance requirements
  • Providing assistance to teams working with datastores
  • Automating routine database tasks
  • Participating in on-call rotation and incident response.
  • Upgrade data storage systems as necessary

AWSSQLAzurePostgresNosqlTerraform

Posted 2024-11-20
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Security and fraud detection

🏒 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.

  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 2024-11-09
Apply
Apply

πŸ“ US

🧭 Full-Time

πŸ’Έ 198000 - 220000 USD per year

πŸ” Blockchain, Cryptocurrency

🏒 Company: Uniswap Labs

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in site reliability engineering, DevOps, or related fields.
  • Strong understanding of reliability engineering principles and tools.
  • Proficiency in monitoring tools like Prometheus, Grafana, Nagios.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
  • Proficiency in scripting tools such as Python, Bash, Ansible, or Terraform.

  • Design, implement, and maintain systems for reliability, availability, and performance of services.
  • Develop and manage monitoring, alerting, and incident response strategies.
  • Conduct root cause analysis of failures.
  • Collaborate with cross-functional teams on reliability practices.
  • Drive improvements and innovations in systems and processes.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCollaborationCI/CDDevOps

Posted 2024-11-07
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 150000 - 230000 USD per year

πŸ” Public safety technology

🏒 Company: Axon

  • This position involves handling of classified federal data; under federal regulations, it is open to US Citizens only.
  • 10+ years of applicable experience.
  • Experience managing cloud platforms such as Azure, AWS, or similar.
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar.
  • Experience using managed languages such as Python, Go, C#, Java, or similar.
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases.
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues.
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems.
  • Familiarity with building flexible and testable Infrastructure as Code modules.
  • Empathy to support the needs of software engineers.

  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely.
  • Exemplify cloud-native site reliability best practices.
  • Write code that is performant, maintainable, clear, and concise.
  • Employ strong problem-solving skills to debug problems in cloud-native distributed systems.
  • Influence and educate the engineering organization to adopt new and improved architectural patterns.
  • Provide robust documentation for use by engineers to promote self-service.
  • Take calculated risks, champion new ideas, and cultivate your craft.

AWSPythonJavaKubernetesC#AzureGoCI/CD

Posted 2024-11-07
Apply
Apply

πŸ“ America

🧭 Contract

πŸ” Digital paper solutions and learning ecosystem

🏒 Company: Goodnotes

  • Strong experience working in AWS-hosted environments.
  • Experience supporting production workloads and firefighting.
  • Knowledge of SRE best practices and common issues.
  • Proficient with system monitoring tools.
  • Understanding and experience with distributed databases.
  • Background in Linux and Networking fundamentals.
  • Experience in back-end development, including API usage and creation.
  • Knowledge of Security for networks and containers.
  • Understanding of container orchestration, especially Kubernetes.
  • Experience managing relational and non-relational databases, including backup and restore operations.
  • Familiarity with automation/configuration management tools, preferably CDK and/or Terraform.

  • Design, build, and maintain the Goodnotes infrastructure according to Dickerson’s Hierarchy of Reliability.
  • Refine and execute new and existing playbooks.
  • Educate teams on SRE best practices including design and capacity planning.
  • Act as a higher-level escalation point for applications.
  • Optimize latency and error rates and improve SLAs.
  • Enhance system monitoring, health reporting, and logging.
  • Implement security practices and maintain information security.
  • Participate in on-call rotation during the Americas Timezone.

Linux

Posted 2024-11-07
Apply
Apply

πŸ“ US, Portugal

🧭 Full-Time

πŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply
Apply

πŸ“ AL, AZ, CA, CO, CT, FL, GA, ID, IL, IN, IA, KY, ME, MD, MA, MI, MN, MO, NV, NJ, NY, NC, OH, OR, PA, TN, TX, VA, WA, WI

🧭 Full-Time

πŸ’Έ 110000 - 135000 USD per year

πŸ” Childcare software

🏒 Company: Procare Solutions

  • Minimum 5 years' of hands-on experience with AWS services including EC2, S3, RDS, Lambda, ECS/EKS.
  • Deep knowledge and extensive experience with Linux operating systems, including system administration and troubleshooting.
  • Familiarity with common SRE-related tools such as Kubernetes, Docker, Prometheus, Grafana, and the ELK stack.
  • Proficiency in infrastructure as code (IaC) tools like Terraform, Ansible, and CloudFormation.
  • Experience with monitoring solutions, including metrics setup and creating alerts.
  • Strong understanding of networking concepts, including DNS, load balancing, and firewalls.
  • Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
  • Excellent problem-solving skills with a proactive and analytical approach.
  • Strong written and verbal communication skills, with the ability to collaborate effectively.
  • Experience in DevOps engineering, including CI/CD practices and tools.

  • Design, implement, and maintain scalable, reliable, and secure AWS infrastructure using best practices.
  • Develop and maintain monitoring, logging, and alerting solutions to ensure system health and performance.
  • Automate infrastructure provisioning, configuration, and deployment processes using tools like Terraform and Ansible.
  • Respond to production incidents, conduct root cause analysis, and implement corrective measures.
  • Continuously analyze system performance and implement tuning improvements.
  • Ensure systems comply with security best practices and manage IAM roles and policies.
  • Collaborate with development teams on reliability integration into the software development lifecycle.
  • Maintain comprehensive documentation of infrastructure and processes.

AWSDockerPythonBashElasticSearchJenkinsKibanaKubernetesElasticsearchGoGrafanaPrometheusCommunication SkillsCollaborationCI/CDProblem Solving

Posted 2024-10-21
Apply
Apply

πŸ“ CA, CO, CT, FL, GA, IL, IN, KY, MA, MI, MN, NC, NJ, NY, OH, OR, PA, SC, TN, TX, UT, VA, WA, WI

πŸ’Έ 145000 - 175000 USD per year

πŸ” Benefits and employee experience

🏒 Company: Jellyvision

  • Demonstrated experience with cloud computing platforms, particularly AWS.
  • Proficient in programming languages including Ruby, Python, and JavaScript.
  • Experienced with configuration management tools such as Ansible, Packer, CloudFormation, and a strong emphasis on Terraform.
  • Skilled in container technologies and orchestration tools like Docker, ECS, and Kubernetes.
  • Experience with continuous integration tools such as GitLab, GitHub, and Jenkins.
  • Knowledge of best practices for monitoring and alerting to ensure system reliability.
  • Exceptional communication skills with various stakeholders.
  • Strong data-driven decision-making capabilities.

  • Design applications by advising development teams on best practices and architecting solutions for optimal performance.
  • Optimize CI/CD pipelines through strategic guidance, minimizing manual tasks, and enhancing operational efficiency.
  • Monitor systems by efficiently resolving alerts, participating in on-call rotations, and supporting application management.
  • Mentor team members by providing guidance, seeking continuous learning opportunities, and giving constructive feedback.

AWSDockerPythonCloud ComputingJavascriptJenkinsKubernetesRubyJavaScriptCommunication SkillsCollaborationCI/CD

Posted 2024-10-21
Apply
Apply

πŸ“ USA

πŸ’Έ 160000 - 195000 USD per year

πŸ” Healthcare

🏒 Company: Garner Health

  • 5+ years of experience delivering software solutions.
  • 4+ years of hands-on production work with cloud infrastructure, containers, monitoring, and alerting.
  • 3+ years working in a security-conscious environment.
  • Expertise and experience leading cloud-first/only projects, preferably on AWS.
  • Expertise with Terraform.
  • Experience with Kubernetes.
  • Experience with Go and Python, particularly utilizing Kubernetes APIs.

  • Architect, operate, improve, and secure the platform the Garner Health app runs on.
  • Boost developer productivity.
  • Build systems to a high engineering standard and ensure others adhere to these standards.
  • Research and advocate for improved techniques, processes, and designs.
  • Collaborate with teammates on strategic platform initiatives.
  • Support the Garner platform in production.
  • Ensure security in production according to regulatory requirements.
  • Partner with stakeholders to maintain product availability and performance.

AWSPythonKubernetesGoTerraform

Posted 2024-10-21
Apply