Apply

Site Reliability Engineer (SRE)

Posted 2024-11-07

View full description

πŸ’Ž Seniority level: Strong experience as an SRE, DevOps Engineer, or Cloud Engineer.

πŸ” Industry: Tech services

🏒 Company: Techie Talent

πŸ—£οΈ Languages: Advanced English

⏳ Experience: Strong experience as an SRE, DevOps Engineer, or Cloud Engineer.

πŸͺ„ Skills: Terraform

Requirements:
  • Strong experience as a Site Reliability Engineer, DevOps Engineer, or Cloud Engineer focusing on observability, automation, and cloud infrastructure.
  • Proven experience with Terraform for cloud infrastructure management.
  • Experience with Azure Monitor, Azure Application Insights, and Log Analytics.
  • Proficient in Kusto Query Language (KQL) for data analysis and monitoring.
  • Ability to respond to alerts, triage incidents, and ensure timely resolution.
  • Development and maintenance of operational runbooks and automated playbooks.
  • Proven ability to work closely with development, operations, and architecture teams.
  • Excellent communication skills and stakeholder management.
  • Experience in automating CI/CD pipelines for long-term deployment efficiency is valued.
  • Advanced English level.
Responsibilities:
  • Develop and maintain software solutions in a varied technology stack.
  • Ensure that products are functional, efficient, reliable, and scalable.
  • Respond to alerts and triage incidents, ensuring timely resolution.
  • Create and maintain operational runbooks and automated playbooks.
Apply

Related Jobs

Apply

πŸ“ Portugal

πŸ” Vertical AI SaaS solutions

🏒 Company: intapp

  • Hands-on experience in building fault-tolerant and scalable systems.
  • Experience with different database technologies such as SQL Server, Postgres, NoSQL.
  • Expertise in Configuration Management and CI/CD tools such as Ansible and Jenkins, Azure DevOps.
  • Hands-on experience with Azure building and running production workloads.
  • Strong scripting abilities in Python, Perl, Go, or JVM-based languages.
  • Solid understanding of continuous integration, deployment and operations concepts.
  • Production experience of managing Windows Infrastructure running IIS workloads.
  • Passion for resolving reliability issues and strategies to mitigate future issues.
  • Automation mindset - if you can automate it, do it.

  • Work with Development and Product Management to design and deliver new functionality.
  • Perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes.
  • Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization.
  • Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
  • Work in an agile operations framework, balancing sprint-based work with daily operations needs.
  • Participate in 24x7 on-call rotation with 12 hours shifts.

PythonSQLAgileJenkinsJVMAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21
Apply
Apply

πŸ“ Portugal

πŸ” Vertical AI SaaS solutions

🏒 Company: Intapp

  • Hands-on experience in building fault-tolerant and scalable systems.
  • Experience with database technologies such as SQL Server, Postgres, and NoSQL.
  • Expertise in Configuration Management and CI/CD tools like Ansible, Jenkins, and Azure DevOps.
  • Hands-on experience with Azure in building and running production workloads.
  • Strong scripting abilities in languages like Python, Perl, Go, or JVM-based languages.
  • Solid understanding of continuous integration, deployment, and operations concepts.
  • Production experience of managing Windows Infrastructure running IIS workloads.
  • Passion for resolving reliability issues and automating processes.

  • Work with Development and Product Management to design and deliver new functionality.
  • Perform deep dives into systemic and latent reliability issues while collaborating with software engineers.
  • Drive standardization efforts across multiple disciplines and services with SREs.
  • Identify and drive opportunities to improve automation for deployment and management of services.
  • Work in an agile operations framework, balancing sprint-based work with daily operations needs.
  • Participate in a 24x7 oncall rotation.

PythonSQLAgileJenkinsJVMProduct ManagementAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21
Apply
Apply

🧭 Full-Time

πŸ” Software / SaaS

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Experience in a cloud-based SaaS environment.
  • Strong expertise in designing and implementing disaster recovery solutions using industry-leading technologies and methodologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
  • Excellent communication skills with the ability to effectively collaborate with cross-functional teams and communicate technical concepts to non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environments.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to test and validate the effectiveness of these plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate potential risks to system availability and data integrity.
  • Monitor system performance and health metrics; proactively identify areas for improvement.
  • Implement preventive measures to enhance system reliability and resilience.
  • Participate in incident response and post-incident reviews; analyze root causes of failures.
  • Implement corrective actions to prevent recurrence.
Posted 2024-11-21
Apply
Apply

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Strong expertise in designing and implementing disaster recovery solutions using leading technologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Excellent communication skills for collaboration with cross-functional teams and non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for a cloud-based SaaS environment.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to validate the effectiveness of disaster recovery plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate risks.
  • Proactively monitor system performance and health metrics, implement preventive measures.
  • Participate in incident response and post-incident reviews to analyze root causes and implement corrective actions.
Posted 2024-11-20
Apply
Apply

🧭 Contract

  • Minimum of 5-7 years experience in Site Reliability Engineering or related fields.
  • Proven experience designing and implementing fault-tolerant, scalable systems.
  • Deep understanding of reliability methodologies like DFR, FMEA, and MTBF.
  • Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage.
  • Strong coding skills in one or more programming languages relevant to SRE.
  • Exceptional analytical skills for complex issue investigation.
  • Willingness to learn new products and tools.
  • Excellent communication skills for a distributed team environment.

  • Identify and resolve complex bugs within the codebase.
  • Enhance system reliability, scalability, and performance through code maintenance.
  • Restart services and implement necessary code changes.
  • Investigate complex system issues and develop resolutions.
  • Design and build fault-tolerant, scalable systems for high availability.
  • Apply methodologies like DFR, FMEA, and MTBF.
  • Develop and maintain reliability standards and documentation.
Posted 2024-11-12
Apply
Apply

πŸ“ LATAM

πŸ” AI developer tools

NOT STATED

  • Report to the Enterprise Engineering Manager.
  • Responsible for setting up and maintaining infrastructure standards.
  • Play a pivotal role in tool development externally and internally.
  • Enable deployment of software to enterprise customers.
  • Establish robust technical excellence for a diversified customer base.
  • Manage variances in infrastructure types and implement suitable solutions.
  • Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply
Apply

πŸ“ US

🧭 Full-Time

πŸ’Έ 198000 - 220000 USD per year

πŸ” Blockchain, Cryptocurrency

🏒 Company: Uniswap Labs

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in site reliability engineering, DevOps, or related fields.
  • Strong understanding of reliability engineering principles and tools.
  • Proficiency in monitoring tools like Prometheus, Grafana, Nagios.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
  • Proficiency in scripting tools such as Python, Bash, Ansible, or Terraform.

  • Design, implement, and maintain systems for reliability, availability, and performance of services.
  • Develop and manage monitoring, alerting, and incident response strategies.
  • Conduct root cause analysis of failures.
  • Collaborate with cross-functional teams on reliability practices.
  • Drive improvements and innovations in systems and processes.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCollaborationCI/CDDevOps

Posted 2024-11-07
Apply
Apply

πŸ“ Germany and within Europe

🧭 Full-Time

πŸ” Technology / Employee Communication

🏒 Company: Flip App

  • Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
  • Deep knowledge of Kubernetes and container solutions.
  • Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
  • Familiarity with SLO, error budget, and Apdex.
  • Good knowledge of software development languages like Go, Python, Kotlin.
  • Business fluent in English; German is a plus.
  • Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

  • Ensure the availability, performance, and scalability of the infrastructure.
  • Promote practices like CI/CD, observability, and developer experience.
  • Shape goals for scalable systems and observability.
  • Expand cloud infrastructure and Kubernetes cluster.
  • Ensure resilience and safety through zero-downtime rollouts.
  • Create observability through the further development of the LGTM stack.
  • Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07
Apply
Apply

πŸ“ America

🧭 Contract

πŸ” Digital paper solutions and learning ecosystem

🏒 Company: Goodnotes

  • Strong experience working in AWS-hosted environments.
  • Experience supporting production workloads and firefighting.
  • Knowledge of SRE best practices and common issues.
  • Proficient with system monitoring tools.
  • Understanding and experience with distributed databases.
  • Background in Linux and Networking fundamentals.
  • Experience in back-end development, including API usage and creation.
  • Knowledge of Security for networks and containers.
  • Understanding of container orchestration, especially Kubernetes.
  • Experience managing relational and non-relational databases, including backup and restore operations.
  • Familiarity with automation/configuration management tools, preferably CDK and/or Terraform.

  • Design, build, and maintain the Goodnotes infrastructure according to Dickerson’s Hierarchy of Reliability.
  • Refine and execute new and existing playbooks.
  • Educate teams on SRE best practices including design and capacity planning.
  • Act as a higher-level escalation point for applications.
  • Optimize latency and error rates and improve SLAs.
  • Enhance system monitoring, health reporting, and logging.
  • Implement security practices and maintain information security.
  • Participate in on-call rotation during the Americas Timezone.

Linux

Posted 2024-11-07
Apply
Apply

🧭 Full-Time

πŸ” Blockchain and Financial Technology

🏒 Company: Core Scientific

  • 5+ years’ experience in SRE, DevOps, and/or Infrastructure Engineering.
  • Excellent communication and interpersonal skills.
  • Strong analytical and troubleshooting skills.
  • Experience with Infrastructure as Code, Configuration Management, & Orchestration tools such as Terraform, Helm, Kustomize, and Ansible.
  • Understanding of cloud environments, primarily AWS.
  • Experience with Kubernetes and virtualization technologies.
  • Proficiency in build and release management with tools like Github Actions.
  • Understanding of telemetry including metrics, logs, and traces.
  • Intermediate scripting skills in Bash, Python, and Make.
  • Basic knowledge of networking protocols.

  • Define, capture, and interpret product/system requirements.
  • Build, integrate, test, monitor, and deploy code across cloud and on-premises infrastructure.
  • Write plans, coordinate, and automate application deployment.
  • Document processes and share knowledge with the team.
  • Promote secure, immutable infrastructure through best practices.
  • Encourage effective communication within the team and across the organization.
  • Perform additional duties as assigned.
Posted 2024-11-07
Apply