Senior Site Reliability Engineer

Posted over 1 year agoViewed

View full description

🔍 Industry: Financial risk management

🗣️ Languages: English

🪄 Skills: DockerPythonKubernetesC (Programming language)

Requirements:

A bachelor's degree in computer science, information systems, or the equivalent combination of education, experience, and training
Fluency in english, both written and spoken
4+ years of experience with aws or azure
Experience with automation, infrastructure-as-code, terraform, ansible, runbooks and troubleshooting guides
Experience with virtualization, container technologies and orchestration (docker, kubernetes)
Programming skills (go, python, or similar languages)
Experience with ci/cd pipelines
Experience with monitoring, troubleshooting and guiding on incidents
Self-driven & motivated, with a strong work ethic and a passion for problem-solving;

Responsibilities:

Build and maintain tools for deployment, monitoring, operations, and analytics
Development with go, python, or similar languages
Document and guide engineers through playbooks and troubleshooting guides
Contribute to the applications self-healing in a cloud-based environment
Leverage, configure and troubleshoot cloud resources in aws
Migrate and operate workloads in kubernetes
Participate in incident response, root cause investigation, and resolution
Maintain and develop our infrastructure as code (iac) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
Utilize your experience and problem-solving skills to help prevent and investigate production issues
Communicate with team members and stakeholders in a globally distributed and asynchronous environment
Investigate, describe, and drive improvements on current infrastructure, promoting evolution and sharing knowledge amongst the team;

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted 1 day ago

📍 United Kingdom

🔍 Software Development

🏢 Company: StarRez👥 251-500💰 Private about 3 years agoConsulting SaaS Property Management Software

🔧 Requirements

1+ years experience working on a SaaS platform
Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
Proficiency in at least one (or more) object-oriented programming language (C# preferable)
Production experience operating containerization technologies (Kubernetes).
Proficiency with one or more public cloud providers such as Azure, AWS or GCP
Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
Proven track record of maintaining highly-available and performant production environments.
Ability to identify and implement effective mitigation strategies and operational playbooks.

💡 Responsibilities

Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
Participate in on-call rotations to ensure system reliability and rapid incident response.
Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
Conduct performance tests to identify and remediate bottlenecks
Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
Monitor, review and tune databases to ensure high availability and performance
Collaborate with product engineering teams to design/build fit-for-purpose and observable software
Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer (SRE) - Poland

Posted 5 days ago

📍 Poland

🔍 Software Development

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with sustainable incident response in a blameless environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 6 days ago

📍 United States

💸 99300.0 - 124100.0 USD per year

🔍 Software Development

🏢 Company: Natera👥 1001-5000💰 $250,000,000 Post-IPO Equity over 1 year ago🫂 Last layoff almost 2 years agoWomen's Biotechnology Medical Genetics Health Diagnostics

🔧 Requirements

Strong all around experience in Amazon Web Services, AWS certification preferred.
Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
Solid experience with EKS, Kubernetes CKA certification preferred.
Strong experience with Terraform.
3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
Good understanding of Docker and Linux / Unix administration.
Practical experience building CI/CD pipelines using GitLab or similar tools.
Practical experience managing applications deployed using Docker in Cloud.
Experience with container orchestration tools.
Strong communication skills. Be able to justify and stand for the proper solution.

💡 Responsibilities

Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
Manage R&D AWS Infrastructure and accounts.
Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
Build and deploy cloud-based infrastructure to support R&D.
Participate in architectural decisions to help improve the quality of our infrastructure and applications.
Work tightly with groups within and external to R&D for best overall systems design and operations.
Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 6 days ago

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted 11 days ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity about 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

A track record of delivering capabilities that build customer value and business impact.
Knowledge of principles for building performant and quality REST APIs.
Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.

💡 Responsibilities

Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 11 days ago

Apply

🔥 Senior Site Reliability Engineer II (Kafka)

Posted 11 days ago

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Braze👥 1001-5000💰 Grant over 1 year agoCRM Analytics Marketing Marketing Automation Software

🔧 Requirements

5+ years of experience as a Software, DevOps, or Site Reliability Engineer
3+ years of Data Streaming Reliability Engineering
Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
3+ years of Kafka performance tuning & automation
Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
Have an urge to collaborate, document, and deliver quickly
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
Have a desire to solve everyday challenges facing software engineers and automate their toil away
Have an excellent ability to manage multiple tasks and expectations at once
Know your way around Linux and Unix Shell.
Have strong programming skills - Ruby and/or Go preferred
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

💡 Responsibilities

Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
Make monitoring and alerting alerts on symptoms and not on outages
Ensure that Braze meets our strict enterprise-grade SLAs with customers
Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
Use your on-call shift to prevent incidents from ever happening
Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 11 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 20 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

NOT STATED

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 20 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 20 days ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Sanity👥 51-200💰 Corporate over 2 years agoSoftware Development

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Analytical mindset for designing, diagnosing, and optimizing infrastructure.
Skilled in managing scalable, highly available, cloud-based applications.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Proficient in building and maintaining CI/CD pipelines.
Familiarity with observability tools like Prometheus and similar stacks.
Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
Open-minded yet discerning when it comes to exploring new technologies.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 20 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 25 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 25 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 28 days ago

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Vantage👥 1001-5000 Cryptocurrency Financial Services FinTech Trading Platform

🔧 Requirements

5 years of experience as a Site Reliability Engineer or DevOps Engineer, working with software and infrastructure.
Experience in one or more of the following: Python, Javascript, Ruby, Groovy, PHP, or Bash.
Experience in one of the cloud platforms: Azure, AWS, or GCP.

💡 Responsibilities

Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
Take ownership of the end-to-end availability and performance of our services, proactively identifying potential issues, and implementing automation to prevent the recurrence of problems.
Participate in an on-call rotation, ensuring our systems remain stable and responsive even during off-hours.
Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
Lead the development, implementation, and achievement of service-level objectives that are instrumental in maintaining product reliability.
Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines, enabling rapid and reliable software releases.
Automate and optimize our infrastructure provisioning, configuration, and management processes using industry-standard tools and best practices.
Implement and manage containerization and orchestration technologies to enhance scalability and resource utilization.
Maintain and enhance version control systems and repositories for codebase management.
Steer and drive the SRE / DevOps roadmap, assuming full ownership while actively engaging in negotiation and strategic planning to ensure its successful execution.
Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

AWSPythonSQLBashCloud ComputingGCPKubernetesSnowflakeAzureCI/CDRESTful APIsDevOpsTerraformTroubleshootingScriptingDebugging

Posted 28 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago

Apply