Site Reliability Engineer (SRE)

Posted 2024-11-21

View full description

📍 Location: Portugal

🔍 Industry: Vertical AI SaaS solutions

🏢 Company: Intapp

🪄 Skills: PythonSQLAgileJenkinsJVMProduct ManagementAzureGoPostgresNosqlCollaborationCI/CDDevOps

Requirements:

Hands-on experience in building fault-tolerant and scalable systems.
Experience with database technologies such as SQL Server, Postgres, and NoSQL.
Expertise in Configuration Management and CI/CD tools like Ansible, Jenkins, and Azure DevOps.
Hands-on experience with Azure in building and running production workloads.
Strong scripting abilities in languages like Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment, and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and automating processes.

Responsibilities:

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into systemic and latent reliability issues while collaborating with software engineers.
Drive standardization efforts across multiple disciplines and services with SREs.
Identify and drive opportunities to improve automation for deployment and management of services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in a 24x7 oncall rotation.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-11-21

📍 Portugal

🔍 Vertical AI SaaS solutions

🏢 Company: intapp

Hands-on experience in building fault-tolerant and scalable systems.
Experience with different database technologies such as SQL Server, Postgres, NoSQL.
Expertise in Configuration Management and CI/CD tools such as Ansible and Jenkins, Azure DevOps.
Hands-on experience with Azure building and running production workloads.
Strong scripting abilities in Python, Perl, Go, or JVM-based languages.
Solid understanding of continuous integration, deployment and operations concepts.
Production experience of managing Windows Infrastructure running IIS workloads.
Passion for resolving reliability issues and strategies to mitigate future issues.
Automation mindset - if you can automate it, do it.

Work with Development and Product Management to design and deliver new functionality.
Perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes.
Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization.
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
Work in an agile operations framework, balancing sprint-based work with daily operations needs.
Participate in 24x7 on-call rotation with 12 hours shifts.

PythonSQLAgileJenkinsJVMAzureGoPostgresNosqlCollaborationCI/CDDevOps

Posted 2024-11-21

Apply

🔥 Site Reliability Engineer (SRE) (m/w/d)

Posted 2024-11-07

📍 Germany and within Europe

🧭 Full-Time

🔍 Technology / Employee Communication

🏢 Company: Flip App

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
Familiarity with SLO, error budget, and Apdex.
Good knowledge of software development languages like Go, Python, Kotlin.
Business fluent in English; German is a plus.
Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

Ensure the availability, performance, and scalability of the infrastructure.
Promote practices like CI/CD, observability, and developer experience.
Shape goals for scalable systems and observability.
Expand cloud infrastructure and Kubernetes cluster.
Ensure resilience and safety through zero-downtime rollouts.
Create observability through the further development of the LGTM stack.
Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 2024-11-07

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07

Apply

🔥 Site Reliability Engineer (SRE)

Posted 2024-10-29

📍 Europe

🧭 Full-Time

🔍 Technology

🏢 Company: Flip GmbH

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools and concepts like SLO, error budget.
Good knowledge of software development (e.g., Go, Python, Kotlin).
Business fluent in English.

Help scale the cloud infrastructure and Kubernetes clusters.
Ensure zero downtime with effective rollout, redundancy, migration strategies, and rollback mechanisms.
Develop and optimize LGTM stack and analyze SLOs.
Enhance operational safety and resilience of systems.
Design, develop and optimize production, development, and cloud infrastructure with Pulumi in Go.
Increase development efficiency and optimize code deployments through effective tools and processes.
Improve the CI/CD pipeline for faster feedback cycles and secure rollouts.

AWSDockerPythonSoftware DevelopmentAgileGCPKubernetesSCRUMAzureCI/CD

Posted 2024-10-29

Apply

🔥 Site Reliability Engineer (SRE) - EMEA

Posted 2024-07-11

📍 EMEA

🔍 Blockchain

Proven experience in an independent contributor role with cloud platform technologies (AWS, GCP, Azure, etc.).
Proficiency in scripting and programming languages such as Python, Golang, or TypeScript.
Experience with container technologies and microservices architecture (e.g., Docker, Kubernetes).
Hands-on experience with monitoring tools like Prometheus, Grafana, ELK stack.
Excellent problem-solving skills and ability to troubleshoot complex issues independently.
Strong understanding of Linux/Unix systems administration and networking concepts.
Strong communication and collaboration skills for effective work in cross-functional teams.

Collaborate with software engineering teams to design scalable, highly available, and resilient systems.
Develop automation tools and scripts for deployment, monitoring, and incident response.
Configure monitoring systems to proactively detect issues and define alerting procedures.
Respond to critical incidents, conduct root cause analysis, and implement preventive measures.
Analyze performance metrics to identify bottlenecks and propose optimizations.
Implement best practices for security and compliance through collaboration with security teams.
Document system configurations and share knowledge with team members.

AWSDockerPythonKubernetesGolangGrafanaPrometheus

Posted 2024-07-11

Apply

Site Reliability Engineer (SRE)

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities