Site Reliability Engineer (SRE)

Posted 2024-11-07

View full description

💎 Seniority level: Strong experience as an SRE, DevOps Engineer, or Cloud Engineer.

🔍 Industry: Tech services

🏢 Company: Techie Talent

🗣️ Languages: Advanced English

⏳ Experience: Strong experience as an SRE, DevOps Engineer, or Cloud Engineer.

🪄 Skills: Terraform

Requirements:

Strong experience as a Site Reliability Engineer, DevOps Engineer, or Cloud Engineer focusing on observability, automation, and cloud infrastructure.
Proven experience with Terraform for cloud infrastructure management.
Experience with Azure Monitor, Azure Application Insights, and Log Analytics.
Proficient in Kusto Query Language (KQL) for data analysis and monitoring.
Ability to respond to alerts, triage incidents, and ensure timely resolution.
Development and maintenance of operational runbooks and automated playbooks.
Proven ability to work closely with development, operations, and architecture teams.
Excellent communication skills and stakeholder management.
Experience in automating CI/CD pipelines for long-term deployment efficiency is valued.
Advanced English level.

Responsibilities:

Develop and maintain software solutions in a varied technology stack.
Ensure that products are functional, efficient, reliable, and scalable.
Respond to alerts and triage incidents, ensuring timely resolution.
Create and maintain operational runbooks and automated playbooks.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-11-21

📍 Portugal

🔍 Vertical AI SaaS solutions

🏢 Company: intapp

Hands-on experience in building fault-tolerant and scalable systems.
Experience with different database technologies such as SQL Server, Postgres, NoSQL.
Expertise in Configuration Management and CI/CD tools such as Ansible and Jenkins, Azure DevOps.
Hands-on experience with Azure building and running production workloads.
Strong scripting abilities in Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and strategies to mitigate future issues.
Automation mindset - if you can automate it, do it.

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes.
Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization.
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in 24x7 on-call rotation with 12 hours shifts.

PythonSQLAgileJenkinsJVMAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-11-21

📍 Portugal

🔍 Vertical AI SaaS solutions

🏢 Company: Intapp

Hands-on experience in building fault-tolerant and scalable systems.
Experience with database technologies such as SQL Server, Postgres, and NoSQL.
Expertise in Configuration Management and CI/CD tools like Ansible, Jenkins, and Azure DevOps.
Hands-on experience with Azure in building and running production workloads.
Strong scripting abilities in languages like Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment, and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and automating processes.

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into systemic and latent reliability issues while collaborating with software engineers.
Drive standardization efforts across multiple disciplines and services with SREs.
Identify and drive opportunities to improve automation for deployment and management of services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in a 24x7 oncall rotation.

PythonSQLAgileJenkinsJVMProduct ManagementAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21

Apply

🔥 Senior Site Reliability Engineer (SRE) - Disaster Recovery Specialist (m/f/x)

Posted 2024-11-21

🧭 Full-Time

🔍 Software / SaaS

Degree in Computer Science, Information Technology, or a related field.
5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
Experience in a cloud-based SaaS environment.
Strong expertise in designing and implementing disaster recovery solutions using industry-leading technologies and methodologies.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
Excellent communication skills with the ability to effectively collaborate with cross-functional teams and communicate technical concepts to non-technical stakeholders.

Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environments.
Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
Conduct drills and exercises to test and validate the effectiveness of these plans.
Collaborate with engineering, operations, and security teams to identify and mitigate potential risks to system availability and data integrity.
Monitor system performance and health metrics; proactively identify areas for improvement.
Implement preventive measures to enhance system reliability and resilience.
Participate in incident response and post-incident reviews; analyze root causes of failures.
Implement corrective actions to prevent recurrence.

Posted 2024-11-21

Apply

🔥 Senior Site Reliability Engineer (SRE) - Disaster Recovery Specialist (m/f/x)

Posted 2024-11-20

🧭 Full-Time

🔍 Software Development

Degree in Computer Science, Information Technology, or a related field.
5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
Strong expertise in designing and implementing disaster recovery solutions using leading technologies.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
Excellent communication skills for collaboration with cross-functional teams and non-technical stakeholders.

Design, implement, and maintain disaster recovery solutions for a cloud-based SaaS environment.
Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
Conduct drills and exercises to validate the effectiveness of disaster recovery plans.
Collaborate with engineering, operations, and security teams to identify and mitigate risks.
Proactively monitor system performance and health metrics, implement preventive measures.
Participate in incident response and post-incident reviews to analyze root causes and implement corrective actions.

Posted 2024-11-20

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 2024-11-12

🧭 Contract

Minimum of 5-7 years experience in Site Reliability Engineering or related fields.
Proven experience designing and implementing fault-tolerant, scalable systems.
Deep understanding of reliability methodologies like DFR, FMEA, and MTBF.
Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage.
Strong coding skills in one or more programming languages relevant to SRE.
Exceptional analytical skills for complex issue investigation.
Willingness to learn new products and tools.
Excellent communication skills for a distributed team environment.

Identify and resolve complex bugs within the codebase.
Enhance system reliability, scalability, and performance through code maintenance.
Restart services and implement necessary code changes.
Investigate complex system issues and develop resolutions.
Design and build fault-tolerant, scalable systems for high availability.
Apply methodologies like DFR, FMEA, and MTBF.
Develop and maintain reliability standards and documentation.

Posted 2024-11-12

Apply

🔥 Senior Site Reliability Engineer (SRE) - LATAM (Remote)

Posted 2024-11-10

📍 LATAM

🔍 AI developer tools

NOT STATED

Report to the Enterprise Engineering Manager.
Responsible for setting up and maintaining infrastructure standards.
Play a pivotal role in tool development externally and internally.
Enable deployment of software to enterprise customers.
Establish robust technical excellence for a diversified customer base.
Manage variances in infrastructure types and implement suitable solutions.
Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 2024-11-07

📍 US

🧭 Full-Time

💸 198000 - 220000 USD per year

🔍 Blockchain, Cryptocurrency

🏢 Company: Uniswap Labs

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
5+ years of experience in site reliability engineering, DevOps, or related fields.
Strong understanding of reliability engineering principles and tools.
Proficiency in monitoring tools like Prometheus, Grafana, Nagios.
Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
Proficiency in scripting tools such as Python, Bash, Ansible, or Terraform.

Design, implement, and maintain systems for reliability, availability, and performance of services.
Develop and manage monitoring, alerting, and incident response strategies.
Conduct root cause analysis of failures.
Collaborate with cross-functional teams on reliability practices.
Drive improvements and innovations in systems and processes.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCollaborationCI/CDDevOps

Posted 2024-11-07

Apply

🔥 Site Reliability Engineer (SRE) (m/w/d)

Posted 2024-11-07

📍 Germany and within Europe

🧭 Full-Time

🔍 Technology / Employee Communication

🏢 Company: Flip App

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
Familiarity with SLO, error budget, and Apdex.
Good knowledge of software development languages like Go, Python, Kotlin.
Business fluent in English; German is a plus.
Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

Ensure the availability, performance, and scalability of the infrastructure.
Promote practices like CI/CD, observability, and developer experience.
Shape goals for scalable systems and observability.
Expand cloud infrastructure and Kubernetes cluster.
Ensure resilience and safety through zero-downtime rollouts.
Create observability through the further development of the LGTM stack.
Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer - SRE - 12 months rolling contract

Posted 2024-11-07

📍 America

🧭 Contract

🔍 Digital paper solutions and learning ecosystem

🏢 Company: Goodnotes

Strong experience working in AWS-hosted environments.
Experience supporting production workloads and firefighting.
Knowledge of SRE best practices and common issues.
Proficient with system monitoring tools.
Understanding and experience with distributed databases.
Background in Linux and Networking fundamentals.
Experience in back-end development, including API usage and creation.
Knowledge of Security for networks and containers.
Understanding of container orchestration, especially Kubernetes.
Experience managing relational and non-relational databases, including backup and restore operations.
Familiarity with automation/configuration management tools, preferably CDK and/or Terraform.

Design, build, and maintain the Goodnotes infrastructure according to Dickerson’s Hierarchy of Reliability.
Refine and execute new and existing playbooks.
Educate teams on SRE best practices including design and capacity planning.
Act as a higher-level escalation point for applications.
Optimize latency and error rates and improve SLAs.
Enhance system monitoring, health reporting, and logging.
Implement security practices and maintain information security.
Participate in on-call rotation during the Americas Timezone.

Linux

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 2024-11-07

🧭 Full-Time

🔍 Blockchain and Financial Technology

🏢 Company: Core Scientific

5+ years’ experience in SRE, DevOps, and/or Infrastructure Engineering.
Excellent communication and interpersonal skills.
Strong analytical and troubleshooting skills.
Experience with Infrastructure as Code, Configuration Management, & Orchestration tools such as Terraform, Helm, Kustomize, and Ansible.
Understanding of cloud environments, primarily AWS.
Experience with Kubernetes and virtualization technologies.
Proficiency in build and release management with tools like Github Actions.
Understanding of telemetry including metrics, logs, and traces.
Intermediate scripting skills in Bash, Python, and Make.
Basic knowledge of networking protocols.

Define, capture, and interpret product/system requirements.
Build, integrate, test, monitor, and deploy code across cloud and on-premises infrastructure.
Write plans, coordinate, and automate application deployment.
Document processes and share knowledge with the team.
Promote secure, immutable infrastructure through best practices.
Encourage effective communication within the team and across the organization.
Perform additional duties as assigned.

Posted 2024-11-07

Apply

Site Reliability Engineer (SRE)

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities