Apply

Senior Site Reliability Engineer

Posted 2024-09-05

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ” Industry: Fintech

🏒 Company: UpgradeπŸ‘₯ 1001-5000Consulting

⏳ Experience: 5+ years

πŸͺ„ Skills: DevOpsTerraform

Requirements:
  • 5+ years of production-level SRE/DevOps experience in a cloud-based environment.
  • In-depth knowledge and hands-on experience with AWS services.
  • Proficiency in programming/scripting languages such as PowerShell, Python, or Bash.
  • Experience with SQL Server databases and Windows Server environments.
  • Knowledge of Ansible (Chef/Puppet) or Terraform.
  • Strong understanding of systems, networks, troubleshooting techniques, and automating build pipeline.
  • Ability to operate in an agile, fast-paced, entrepreneurial start-up environment.
  • Experience providing SRE/DevOps support to development teams for debugging Java applications is a plus.
Responsibilities:
  • Build a resilient, secure, and efficient cloud-based platform.
  • Automate deployment, monitoring, management, and incident response.
  • Monitor and troubleshoot platform issues.
  • Build and scale technology infrastructure to meet increasing demand.
  • Manage cross-functional requirements and collaborate with various stakeholders.
  • Work with Development and QA to deploy new features and services.
  • Develop and improve operational practices and procedures.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ” Legal technology

🏒 Company: Ramp Talent

  • Curiosity, willingness to learn, and passion for continuous improvement.
  • Proficiency in all skills expected of SRE II's.
  • Bachelor's degree in computer science, information systems, related field; comparable certifications; or equivalent direct work experience.
  • A minimum of 8 years of experience in hands-on technical roles.
  • A minimum of 2 years of Site Reliability Engineering experience.
  • Experience building autonomous systems that manage software operational details without human intervention.

  • Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products.
  • Being the voice of Reliability on your team throughout the SDLC.
  • Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events.
  • Improving the CI/CD pipeline.
  • Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time.
  • Identifying and fixing gaps in the availability of systems.
  • Improving and defending the security of software and systems.
  • Documenting and diagramming processes, procedures, and best practices.
  • Finding, learning, improving, or creating new tools that are reliable, usable, and helpful.
  • Mentoring, training, and reviewing more junior engineers.
  • Participating in an on-call rotation for 24/7 production reliability support.

LeadershipCI/CDMentoring

Posted 2024-11-21
Apply
Apply

πŸ“ Poland

πŸ” Software

Posted 2024-11-21
Apply
Apply

🧭 Full-Time

πŸ” Software / SaaS

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Experience in a cloud-based SaaS environment.
  • Strong expertise in designing and implementing disaster recovery solutions using industry-leading technologies and methodologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
  • Excellent communication skills with the ability to effectively collaborate with cross-functional teams and communicate technical concepts to non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environments.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to test and validate the effectiveness of these plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate potential risks to system availability and data integrity.
  • Monitor system performance and health metrics; proactively identify areas for improvement.
  • Implement preventive measures to enhance system reliability and resilience.
  • Participate in incident response and post-incident reviews; analyze root causes of failures.
  • Implement corrective actions to prevent recurrence.
Posted 2024-11-21
Apply
Apply

πŸ“ USA

πŸ’Έ 170000 - 190000 USD per year

πŸ” Email Security

🏒 Company: Valimail

  • 5+ years experience building and maintaining highly available relational databases.
  • Work collaboratively with cross functional teams
  • Value team success over individual success
  • Put industry and engineering best practices into practice and promotes them to others
  • Passion for reliable, scalable, and performant datastores with strong sense of ownership
  • Experience building and supporting highly performant and highly reliable datastores
  • Deep experience working with Postgres
  • Expert in database fundamentals, SQL, PL/pgSQL, (or other)
  • Experience with NOSQL datastores and caching solutions
  • Working knowledge of AWS or Azure cloud providers
  • Experience with Infrastructure-as-Code tools, such as Terraform

  • Evangelizing standard methodologies for building and operating highly reliable data storage systems
  • Serving as the subject matter expert in datastore design and performance
  • Building and supporting Valimail’s mission-critical datastores
  • Conducting timely post mortems of production datastore incidents
  • Collaboratively designing systems with other engineers to meet reliability, scalability, and performance requirements
  • Providing assistance to teams working with datastores
  • Automating routine database tasks
  • Participating in on-call rotation and incident response.
  • Upgrade data storage systems as necessary

AWSSQLAzurePostgresNosqlTerraform

Posted 2024-11-20
Apply
Apply

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Strong expertise in designing and implementing disaster recovery solutions using leading technologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Excellent communication skills for collaboration with cross-functional teams and non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for a cloud-based SaaS environment.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to validate the effectiveness of disaster recovery plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate risks.
  • Proactively monitor system performance and health metrics, implement preventive measures.
  • Participate in incident response and post-incident reviews to analyze root causes and implement corrective actions.
Posted 2024-11-20
Apply
Apply

πŸ“ Canada

πŸ” Software Supply Chain Management

🏒 Company: FOSSA

  • Strong, demonstrated experience as a technical lead designing, building, and maintaining scalable infrastructure and tooling.
  • Strong knowledge of at least one cloud platform and maintaining managed services (we use AWS).
  • Strong experience implementing Infrastructure as Code using Terraform, Helm, and Kubernetes.
  • Experience building and maintaining build pipelines, deploying new services, and familiarity with CI/CD tools such as Buildkite, CircleCI, and GitHub Actions.
  • Experience with logging and monitoring tools such as Datadog, Statsd, Prometheus, Grafana.
  • Experience with packaging and deploying services using Docker on Linux.
  • Ability to break down complex problems, troubleshoot, drive towards a solution, and communicate it with the team and stakeholders.
  • Willingness to accept feedback and incorporate it into work.
  • Experience with source control tooling and processes, including branching, merging, and rebasing (we use git).
  • Willingness to take part in an on-call rotation.

  • Scale cloud infrastructure to meet increasing demand.
  • Assist development teams in deploying new services.
  • Ensure platform security and adherence to best practices.
  • Improve development tools, CI/CD pipelines, monitoring, and release processes.
  • Help teams use Helm and Kubernetes, and shape best practices.
  • Build access control and secret management solutions.
  • Maintain deployments for on-premise customers.

AWSDockerGitKubernetesGrafanaPrometheusCI/CDLinuxTerraform

Posted 2024-11-17
Apply
Apply

πŸ“ Netherlands

πŸ” Creative Technology

🏒 Company: Creative Fabrica

  • 4+ years operating and supporting a high-volume, high-performance, cloud-native distributed computing environment.
  • Proven experience with Terraform, containers, and monitoring solutions.
  • Experience with a wide array of AWS-based services (EC2, ECS/EKS, S3, RDS, ALB, MSK, DynamoDB, Redshift, etc).
  • Experience supporting and deploying applications and microservices written in Go, Python, and PHP.
  • Experience with driving DevOps practices and developing automation solutions in a continuous deployment environment.
  • Experience with Kubernetes and Kafka is highly preferred.

  • Improve our site infrastructure to keep up with the company’s fast growth and technology evolution.
  • Proactively monitor the infrastructure and propose improvements.
  • Lead the design and building of a fully automated, developer self-service platform.
  • Research, develop and implement infrastructure management standards across our cloud accounts (AWS).
  • Participate in pre-production and production site releases.
  • Participate in the on-call rotation and in the debugging of issues.

AWSPHPPythonDynamoDBKafkaKubernetesTypeScriptGoDevOpsTerraformMicroservices

Posted 2024-11-16
Apply
Apply

πŸ“ U.S.

🧭 Full-Time

πŸ’Έ 140000 - 160000 USD per year

πŸ” Cybersecurity / Open source software

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise with multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency with at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi).
  • Proficiency using source control such as Git.
  • Ability to maintain discretion and handle sensitive information.
  • Staying current with trends and new technologies.
  • Collaborative and adaptable mindset.
  • Excellent communication skills.
  • Strong problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on quality.
  • Evaluate infrastructure regularly, making recommendations for reliability, security, availability, scalability, and cost management.
  • Implement site reliability tools and observability systems.
  • Respond to outages and participate in a 24x7 support strategy.
  • Contribute to architectural designs and engineering operations at scale.
  • Engage in code reviews and spread technical knowledge.
  • Contribute to incident management processes.
  • Collaborate with teams to refine priorities and deliverables.
  • Align SLIs, SLOs, and SLAs with product owners.
  • Identify opportunities for new initiatives.
  • Influence the SDLC as Bitwarden scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-11-16
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” InsurTech

  • Extensive experience in infrastructure security, monitoring, release engineering, and developer tooling.
  • Ability to coach and mentor less experienced professionals.
  • Demonstrated leadership skills in guiding teams and improving capabilities.

  • Work with the Engineering Department to develop and provide infrastructure security, monitoring, release engineering, and developer tooling based on group-level and department-level requirements.
  • Provide guidance and leadership to DevOps chapter representatives from teams across the Engineering Department.
  • Suggest, plan, guide, and assist with the development and implementation of infrastructure to support goals.
  • Coach and mentor lower-level professionals.
  • Assist the Engineering Leadership Team in continuously improving craft capabilities.

LeadershipMentoringDevOpsCoaching

Posted 2024-11-15
Apply
Apply

🧭 Contract

  • Minimum of 5-7 years experience in Site Reliability Engineering or related fields.
  • Proven experience designing and implementing fault-tolerant, scalable systems.
  • Deep understanding of reliability methodologies like DFR, FMEA, and MTBF.
  • Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage.
  • Strong coding skills in one or more programming languages relevant to SRE.
  • Exceptional analytical skills for complex issue investigation.
  • Willingness to learn new products and tools.
  • Excellent communication skills for a distributed team environment.

  • Identify and resolve complex bugs within the codebase.
  • Enhance system reliability, scalability, and performance through code maintenance.
  • Restart services and implement necessary code changes.
  • Investigate complex system issues and develop resolutions.
  • Design and build fault-tolerant, scalable systems for high availability.
  • Apply methodologies like DFR, FMEA, and MTBF.
  • Develop and maintain reliability standards and documentation.
Posted 2024-11-12
Apply