Site Reliability Engineer (SRE)

Posted 2024-10-29

View full description

📍 Location: Europe

🔍 Industry: Technology

🏢 Company: Flip GmbH

🗣️ Languages: English

🪄 Skills: AWSDockerPythonSoftware DevelopmentAgileGCPKubernetesSCRUMAzureCI/CD

Requirements:

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools and concepts like SLO, error budget.
Good knowledge of software development (e.g., Go, Python, Kotlin).
Business fluent in English.

Responsibilities:

Help scale the cloud infrastructure and Kubernetes clusters.
Ensure zero downtime with effective rollout, redundancy, migration strategies, and rollback mechanisms.
Develop and optimize LGTM stack and analyze SLOs.
Enhance operational safety and resilience of systems.
Design, develop and optimize production, development, and cloud infrastructure with Pulumi in Go.
Increase development efficiency and optimize code deployments through effective tools and processes.
Improve the CI/CD pipeline for faster feedback cycles and secure rollouts.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-11-21

📍 Portugal

🔍 Vertical AI SaaS solutions

🏢 Company: intapp

Hands-on experience in building fault-tolerant and scalable systems.
Experience with different database technologies such as SQL Server, Postgres, NoSQL.
Expertise in Configuration Management and CI/CD tools such as Ansible and Jenkins, Azure DevOps.
Hands-on experience with Azure building and running production workloads.
Strong scripting abilities in Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and strategies to mitigate future issues.
Automation mindset - if you can automate it, do it.

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes.
Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization.
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in 24x7 on-call rotation with 12 hours shifts.

PythonSQLAgileJenkinsJVMAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-11-21

📍 Portugal

🔍 Vertical AI SaaS solutions

🏢 Company: Intapp

Hands-on experience in building fault-tolerant and scalable systems.
Experience with database technologies such as SQL Server, Postgres, and NoSQL.
Expertise in Configuration Management and CI/CD tools like Ansible, Jenkins, and Azure DevOps.
Hands-on experience with Azure in building and running production workloads.
Strong scripting abilities in languages like Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment, and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and automating processes.

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into systemic and latent reliability issues while collaborating with software engineers.
Drive standardization efforts across multiple disciplines and services with SREs.
Identify and drive opportunities to improve automation for deployment and management of services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in a 24x7 oncall rotation.

PythonSQLAgileJenkinsJVMProduct ManagementAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21

Apply

🔥 Senior Site Reliability Engineer (SRE) - LATAM (Remote)

Posted 2024-11-10

📍 LATAM

🔍 AI developer tools

NOT STATED

Report to the Enterprise Engineering Manager.
Responsible for setting up and maintaining infrastructure standards.
Play a pivotal role in tool development externally and internally.
Enable deployment of software to enterprise customers.
Establish robust technical excellence for a diversified customer base.
Manage variances in infrastructure types and implement suitable solutions.
Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10

Apply

🔥 Site Reliability Engineer (SRE) (m/w/d)

Posted 2024-11-07

📍 Germany and within Europe

🧭 Full-Time

🔍 Technology / Employee Communication

🏢 Company: Flip App

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
Familiarity with SLO, error budget, and Apdex.
Good knowledge of software development languages like Go, Python, Kotlin.
Business fluent in English; German is a plus.
Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

Ensure the availability, performance, and scalability of the infrastructure.
Promote practices like CI/CD, observability, and developer experience.
Shape goals for scalable systems and observability.
Expand cloud infrastructure and Kubernetes cluster.
Ensure resilience and safety through zero-downtime rollouts.
Create observability through the further development of the LGTM stack.
Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 2024-11-07

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07

Apply

🔥 Staff Site Reliability Engineer (SRE) - Poland

Posted 2024-10-03

📍 Poland

🔍 IT and Security

🏢 Company: Cribl👥 251-500💰 $150.0m Series D on 2022-05-24Real Time Big Data Information Technology Software

Extensive experience with enterprise scale continuous delivery environments.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with sustainable incident response in a blameless environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering.
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
Comfortable with a high level of autonomy and working with a distributed team.

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency and overall system health.
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Help identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinuxTerraform

Posted 2024-10-03

Apply

🔥 OnCall Site Reliability Engineer (SRE)

Posted 2024-09-20

📍 Slovakia

🔍 IGaming

🏢 Company: GoReel

Understanding of cloud infrastructure and container orchestration.
Familiarity with monitoring and logging tools.
Strong problem-solving skills and attention to detail.
Ability to work effectively in a team environment.
Excellent communication skills.

Monitor and maintain the health of systems, ensuring high availability and performance.
Respond to incidents and troubleshoot issues in a timely manner.
Collaborate with development and operations teams to implement improvements and optimize system performance.
Create and maintain documentation for incident response and system maintenance procedures.
Participate in on-call rotations to provide 24/7 support.

AWSPostgreSQLElasticSearchKafkaKubernetesCassandraElasticsearchGrafanaCommunication Skills

Posted 2024-09-20

Apply

🔥 Senior Site Reliability Engineer (SRE) - Poland

Posted 2024-08-29

📍 Poland

🧭 Full-Time

🔍 Data observability and IT Security

🏢 Company: Cribl👥 251-500💰 $150.0m Series D on 2022-05-24Real Time Big Data Information Technology Software

Extensive experience with enterprise scale continuous delivery environments.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with sustainable incident response in a blameless environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
Experience with APM and Observability and related tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering.
Experience with incident response related tools like PagerDuty, FireHydrant, Blameless etc.
Comfortable with a high level of autonomy and working with a distributed team.

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Help identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinux

Posted 2024-08-29

Apply

🔥 Site Reliability Engineer (SRE) - EMEA

Posted 2024-07-11

📍 EMEA

🔍 Blockchain

Proven experience in an independent contributor role with cloud platform technologies (AWS, GCP, Azure, etc.).
Proficiency in scripting and programming languages such as Python, Golang, or TypeScript.
Experience with container technologies and microservices architecture (e.g., Docker, Kubernetes).
Hands-on experience with monitoring tools like Prometheus, Grafana, ELK stack.
Excellent problem-solving skills and ability to troubleshoot complex issues independently.
Strong understanding of Linux/Unix systems administration and networking concepts.
Strong communication and collaboration skills for effective work in cross-functional teams.

Collaborate with software engineering teams to design scalable, highly available, and resilient systems.
Develop automation tools and scripts for deployment, monitoring, and incident response.
Configure monitoring systems to proactively detect issues and define alerting procedures.
Respond to critical incidents, conduct root cause analysis, and implement preventive measures.
Analyze performance metrics to identify bottlenecks and propose optimizations.
Implement best practices for security and compliance through collaboration with security teams.
Document system configurations and share knowledge with team members.

AWSDockerPythonKubernetesGolangGrafanaPrometheus

Posted 2024-07-11

Apply

Site Reliability Engineer (SRE)

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities