Senior Site Reliability Engineer

Posted 3 months agoInactiveViewed

View full description

💎 Seniority level: Senior, Minimum of 5 years

📍 Location: Canada, Chile

🔍 Industry: Technology

🏢 Company: Launchpad Technologies

⏳ Experience: Minimum of 5 years

🪄 Skills: AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Requirements:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
Familiarity with monitoring tools and systems.
Proficient in scripting languages such as Python, Bash, or Ruby.
Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
Excellent troubleshooting and analytical skills.
Strong communication skills and the ability to work effectively within a team.

Responsibilities:

Develop, maintain, and improve automated deployment, certification, and validation pipelines.
Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
Manage third-party services and technologies used to support the SRE discipline.
Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
Define and implement an observability framework to provide insights into system performance and behavior.
Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
Own operational incident management, providing support to related teams and individuals during incident resolution.
Identify and implement best practices for system reliability, security, scalability, and performance.
Participate in on-call rotations for system support, troubleshooting, and resolution.
Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
Develop and maintain documentation for systems, processes, and procedures.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted 11 days ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity about 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

A track record of delivering capabilities that build customer value and business impact.
Knowledge of principles for building performant and quality REST APIs.
Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.

💡 Responsibilities

Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 11 days ago

Apply

🔥 Senior Site Reliability Engineer II (Kafka)

Posted 11 days ago

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Braze👥 1001-5000💰 Grant over 1 year agoCRM Analytics Marketing Marketing Automation Software

🔧 Requirements

5+ years of experience as a Software, DevOps, or Site Reliability Engineer
3+ years of Data Streaming Reliability Engineering
Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
3+ years of Kafka performance tuning & automation
Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
Have an urge to collaborate, document, and deliver quickly
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
Have a desire to solve everyday challenges facing software engineers and automate their toil away
Have an excellent ability to manage multiple tasks and expectations at once
Know your way around Linux and Unix Shell.
Have strong programming skills - Ruby and/or Go preferred
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

💡 Responsibilities

Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
Make monitoring and alerting alerts on symptoms and not on outages
Ensure that Braze meets our strict enterprise-grade SLAs with customers
Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
Use your on-call shift to prevent incidents from ever happening
Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 11 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 25 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 25 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 28 days ago

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Vantage👥 1001-5000 Cryptocurrency Financial Services FinTech Trading Platform

🔧 Requirements

5 years of experience as a Site Reliability Engineer or DevOps Engineer, working with software and infrastructure.
Experience in one or more of the following: Python, Javascript, Ruby, Groovy, PHP, or Bash.
Experience in one of the cloud platforms: Azure, AWS, or GCP.

💡 Responsibilities

Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
Take ownership of the end-to-end availability and performance of our services, proactively identifying potential issues, and implementing automation to prevent the recurrence of problems.
Participate in an on-call rotation, ensuring our systems remain stable and responsive even during off-hours.
Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
Lead the development, implementation, and achievement of service-level objectives that are instrumental in maintaining product reliability.
Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines, enabling rapid and reliable software releases.
Automate and optimize our infrastructure provisioning, configuration, and management processes using industry-standard tools and best practices.
Implement and manage containerization and orchestration technologies to enhance scalability and resource utilization.
Maintain and enhance version control systems and repositories for codebase management.
Steer and drive the SRE / DevOps roadmap, assuming full ownership while actively engaging in negotiation and strategic planning to ensure its successful execution.
Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

AWSPythonSQLBashCloud ComputingGCPKubernetesSnowflakeAzureCI/CDRESTful APIsDevOpsTerraformTroubleshootingScriptingDebugging

Posted 28 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 LATAM

🧭 Full-Time

💸 51850.0 - 116650.0 USD per year

🔍 Remote employment solutions

🏢 Company: Remote - Referral Board

🔧 Requirements

Significant and demonstrated experience as a Senior Site Reliability Engineer.
Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
Knowledge of CI/CD tools, with a preference for GitLab CI.
Experience with a back-end programming language such as Elixir, Clojure, Java, Node.js, or Python.
Experience in a programming language used for developing SRE tooling, like Go or Python.
Experience running and configuring Linux systems in non-cloud environments.
Security knowledge from both defensive and offensive perspectives.
Excellent communication and interpersonal skills.

💡 Responsibilities

Managing and improving existing infrastructure.
Helping build the next generation of the platform using tools like Kubernetes, Terraform, and Docker.
Streamlining and automating deployment processes.
Working closely with the Security team to address potential threats and patches.
Supporting engineers and product teams to enhance scalability, stability, and reliability.

AWSPythonKubernetesGoLinuxTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 LATAM

🧭 Full-Time

💸 51850.0 - 116650.0 USD per year

🔍 Remote Employment and Compliance Solutions

🏢 Company: Remote👥 1001-5000💰 $300,000,000 Series C almost 3 years ago🫂 Last layoff over 2 years agoHuman Resources Services

🔧 Requirements

Significant and demonstrated experience as a Senior Site Reliability Engineer, which includes architecting, implementing, and maintaining a Platform for other teams.
Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
Knowledge of CI/CD tools (GitLab CI is preferred).
Experience with a back-end programming language (Elixir, Clojure, Java, Node.js, Python, etc.).
Experience with a programming language for SRE tooling (Go, Python).
Experience running and configuring Linux systems in a non-cloud environment.
Security knowledge from both defensive and offensive perspectives.
Excellent communication and interpersonal skills.

💡 Responsibilities

Managing and improving our existing infrastructure.
Helping build the next generation of our platform using tools like Kubernetes, Terraform, and Docker.
Streamlining and automating deployment processes.
Working closely with the Security team to address potential threats and patches.
Supporting engineers and product teams to improve overall scalability, stability, and reliability.

AWSPythonKubernetesGoCI/CDLinuxTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 Argentina, Brazil

🧭 Full-Time

💸 65000.0 - 90000.0 USD per year

🔍 Cybersecurity

🏢 Company: SecurityScorecard👥 251-500💰 $180,000,000 Series E almost 4 years agoSecurity Risk Management Cyber Security Software

🔧 Requirements

Proven experience as an SRE, DevOps Engineer, or similar role
Strong background in CI/CD tools (Jenkins, GitHub Actions, etc.)
Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Docker, Kubernetes)
Proficiency with infrastructure as code tools (Terraform, Ansible)
Experience with automated testing frameworks (Selenium, JUnit)
Knowledge of scripting languages (Python, Bash)
Familiarity with monitoring and observability tools (Prometheus, Grafana)

💡 Responsibilities

Design, implement, and maintain CI/CD pipelines
Enhance infrastructure as code practices
Optimize deployment rollbacks and improve incident response
Develop automated testing strategies
Collaborate with developers for application reliability
Build monitoring and alerting solutions
Drive improvements in observability and metrics collection
Participate in on-call rotations

AWSDockerPythonBashJUNITKubernetesGrafanaPrometheusSeleniumCI/CDTerraform

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 Worldwide

🧭 Contract

🔍 Software Development

🏢 Company: Teravision Technologies👥 251-500💰 over 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
Hands-on experience with AWS services such as S3, Route 53, and others.
Strong understanding of backend systems and infrastructure management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role.
Knowledge of monitoring and alerting tools to support on-call responsibilities.

💡 Responsibilities

NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE) - Disaster Recovery Specialist (m/f/x)

Posted 4 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🔧 Requirements