Apply

Site Reliability Engineer (SRE)

Posted 2024-10-29

View full description

šŸ“ Location: Europe

šŸ” Industry: Technology

šŸ¢ Company: Flip GmbH

šŸ—£ļø Languages: English

šŸŖ„ Skills: AWSDockerPythonSoftware DevelopmentAgileGCPKubernetesSCRUMAzureCI/CD

Requirements:
  • Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
  • Deep knowledge of Kubernetes and container solutions.
  • Interest in observability tools and concepts like SLO, error budget.
  • Good knowledge of software development (e.g., Go, Python, Kotlin).
  • Business fluent in English.
Responsibilities:
  • Help scale the cloud infrastructure and Kubernetes clusters.
  • Ensure zero downtime with effective rollout, redundancy, migration strategies, and rollback mechanisms.
  • Develop and optimize LGTM stack and analyze SLOs.
  • Enhance operational safety and resilience of systems.
  • Design, develop and optimize production, development, and cloud infrastructure with Pulumi in Go.
  • Increase development efficiency and optimize code deployments through effective tools and processes.
  • Improve the CI/CD pipeline for faster feedback cycles and secure rollouts.
Apply

Related Jobs

Apply

šŸ“ Portugal

šŸ” Vertical AI SaaS solutions

šŸ¢ Company: intapp

  • Hands-on experience in building fault-tolerant and scalable systems.
  • Experience with different database technologies such as SQL Server, Postgres, NoSQL.
  • Expertise in Configuration Management and CI/CD tools such as Ansible and Jenkins, Azure DevOps.
  • Hands-on experience with Azure building and running production workloads.
  • Strong scripting abilities in Python, Perl, Go, or JVM-based languages.
  • Solid understanding of continuous integration, deployment and operations concepts.
  • Production experience of managing Windows Infrastructure running IIS workloads.
  • Passion for resolving reliability issues and strategies to mitigate future issues.
  • Automation mindset - if you can automate it, do it.

  • Work with Development and Product Management to design and deliver new functionality.
  • Perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes.
  • Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization.
  • Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
  • Work in an agile operations framework, balancing sprint-based work with daily operations needs.
  • Participate in 24x7 on-call rotation with 12 hours shifts.

PythonSQLAgileJenkinsJVMAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21
Apply
Apply

šŸ“ Portugal

šŸ” Vertical AI SaaS solutions

šŸ¢ Company: Intapp

  • Hands-on experience in building fault-tolerant and scalable systems.
  • Experience with database technologies such as SQL Server, Postgres, and NoSQL.
  • Expertise in Configuration Management and CI/CD tools like Ansible, Jenkins, and Azure DevOps.
  • Hands-on experience with Azure in building and running production workloads.
  • Strong scripting abilities in languages like Python, Perl, Go, or JVM-based languages.
  • Solid understanding of continuous integration, deployment, and operations concepts.
  • Production experience of managing Windows Infrastructure running IIS workloads.
  • Passion for resolving reliability issues and automating processes.

  • Work with Development and Product Management to design and deliver new functionality.
  • Perform deep dives into systemic and latent reliability issues while collaborating with software engineers.
  • Drive standardization efforts across multiple disciplines and services with SREs.
  • Identify and drive opportunities to improve automation for deployment and management of services.
  • Work in an agile operations framework, balancing sprint-based work with daily operations needs.
  • Participate in a 24x7 oncall rotation.

PythonSQLAgileJenkinsJVMProduct ManagementAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21
Apply
Apply

šŸ“ LATAM

šŸ” AI developer tools

NOT STATED

  • Report to the Enterprise Engineering Manager.
  • Responsible for setting up and maintaining infrastructure standards.
  • Play a pivotal role in tool development externally and internally.
  • Enable deployment of software to enterprise customers.
  • Establish robust technical excellence for a diversified customer base.
  • Manage variances in infrastructure types and implement suitable solutions.
  • Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply
Apply

šŸ“ Germany and within Europe

šŸ§­ Full-Time

šŸ” Technology / Employee Communication

šŸ¢ Company: Flip App

  • Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
  • Deep knowledge of Kubernetes and container solutions.
  • Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
  • Familiarity with SLO, error budget, and Apdex.
  • Good knowledge of software development languages like Go, Python, Kotlin.
  • Business fluent in English; German is a plus.
  • Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

  • Ensure the availability, performance, and scalability of the infrastructure.
  • Promote practices like CI/CD, observability, and developer experience.
  • Shape goals for scalable systems and observability.
  • Expand cloud infrastructure and Kubernetes cluster.
  • Ensure resilience and safety through zero-downtime rollouts.
  • Create observability through the further development of the LGTM stack.
  • Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07
Apply
Apply

šŸ“ US, Portugal

šŸ§­ Full-Time

šŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply
Apply

šŸ“ Poland

šŸ” IT and Security

šŸ¢ Company: CriblšŸ‘„ 251-500šŸ’° $150.0m Series D on 2022-05-24Real TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with sustainable incident response in a blameless environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering.
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health.
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinuxTerraform

Posted 2024-10-03
Apply
Apply

šŸ“ Slovakia

šŸ” IGaming

šŸ¢ Company: GoReel

  • Understanding of cloud infrastructure and container orchestration.
  • Familiarity with monitoring and logging tools.
  • Strong problem-solving skills and attention to detail.
  • Ability to work effectively in a team environment.
  • Excellent communication skills.

  • Monitor and maintain the health of systems, ensuring high availability and performance.
  • Respond to incidents and troubleshoot issues in a timely manner.
  • Collaborate with development and operations teams to implement improvements and optimize system performance.
  • Create and maintain documentation for incident response and system maintenance procedures.
  • Participate in on-call rotations to provide 24/7 support.

AWSPostgreSQLElasticSearchKafkaKubernetesCassandraElasticsearchGrafanaCommunication Skills

Posted 2024-09-20
Apply
Apply

šŸ“ Poland

šŸ§­ Full-Time

šŸ” Data observability and IT Security

šŸ¢ Company: CriblšŸ‘„ 251-500šŸ’° $150.0m Series D on 2022-05-24Real TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with sustainable incident response in a blameless environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies.
  • Experience with APM and Observability and related tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering.
  • Experience with incident response related tools like PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team.

  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinux

Posted 2024-08-29
Apply
Apply

šŸ“ EMEA

šŸ” Blockchain

  • Proven experience in an independent contributor role with cloud platform technologies (AWS, GCP, Azure, etc.).
  • Proficiency in scripting and programming languages such as Python, Golang, or TypeScript.
  • Experience with container technologies and microservices architecture (e.g., Docker, Kubernetes).
  • Hands-on experience with monitoring tools like Prometheus, Grafana, ELK stack.
  • Excellent problem-solving skills and ability to troubleshoot complex issues independently.
  • Strong understanding of Linux/Unix systems administration and networking concepts.
  • Strong communication and collaboration skills for effective work in cross-functional teams.

  • Collaborate with software engineering teams to design scalable, highly available, and resilient systems.
  • Develop automation tools and scripts for deployment, monitoring, and incident response.
  • Configure monitoring systems to proactively detect issues and define alerting procedures.
  • Respond to critical incidents, conduct root cause analysis, and implement preventive measures.
  • Analyze performance metrics to identify bottlenecks and propose optimizations.
  • Implement best practices for security and compliance through collaboration with security teams.
  • Document system configurations and share knowledge with team members.

AWSDockerPythonKubernetesGolangGrafanaPrometheus

Posted 2024-07-11
Apply