Apply

Senior Site Reliability Engineer

Posted 3 months agoInactiveViewed

View full description

πŸ’Ž Seniority level: Senior, Minimum of 5 years

πŸ“ Location: Canada, Chile

πŸ” Industry: Technology

🏒 Company: Launchpad Technologies

⏳ Experience: Minimum of 5 years

πŸͺ„ Skills: AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Requirements:
  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Familiarity with monitoring tools and systems.
  • Proficient in scripting languages such as Python, Bash, or Ruby.
  • Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
  • Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
  • Excellent troubleshooting and analytical skills.
  • Strong communication skills and the ability to work effectively within a team.
Responsibilities:
  • Develop, maintain, and improve automated deployment, certification, and validation pipelines.
  • Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
  • Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
  • Manage third-party services and technologies used to support the SRE discipline.
  • Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
  • Define and implement an observability framework to provide insights into system performance and behavior.
  • Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
  • Own operational incident management, providing support to related teams and individuals during incident resolution.
  • Identify and implement best practices for system reliability, security, scalability, and performance.
  • Participate in on-call rotations for system support, troubleshooting, and resolution.
  • Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
  • Develop and maintain documentation for systems, processes, and procedures.
Apply

Related Jobs

Apply

πŸ“ United Kingdom, Canada

πŸ” Software Development

🏒 Company: GoDaddyπŸ‘₯ 5001-10000πŸ’° $800,000,000 Post-IPO Equity about 3 years agoπŸ«‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 11 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: BrazeπŸ‘₯ 1001-5000πŸ’° Grant over 1 year agoCRMAnalyticsMarketingMarketing AutomationSoftware

  • 5+ years of experience as a Software, DevOps, or Site Reliability Engineer
  • 3+ years of Data Streaming Reliability Engineering
  • Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
  • Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
  • 3+ years of Kafka performance tuning & automation
  • Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
  • Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
  • You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
  • Have an urge to collaborate, document, and deliver quickly
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a desire to solve everyday challenges facing software engineers and automate their toil away
  • Have an excellent ability to manage multiple tasks and expectations at once
  • Know your way around Linux and Unix Shell.
  • Have strong programming skills - Ruby and/or Go preferred
  • Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
  • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
  • Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
  • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
  • Make monitoring and alerting alerts on symptoms and not on outages
  • Ensure that Braze meets our strict enterprise-grade SLAs with customers
  • Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
  • Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
  • Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
  • Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
  • Use your on-call shift to prevent incidents from ever happening
  • Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 11 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 25 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: VantageπŸ‘₯ 1001-5000CryptocurrencyFinancial ServicesFinTechTrading Platform

  • 5 years of experience as a Site Reliability Engineer or DevOps Engineer, working with software and infrastructure.
  • Experience in one or more of the following: Python, Javascript, Ruby, Groovy, PHP, or Bash.
  • Experience in one of the cloud platforms: Azure, AWS, or GCP.
  • Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
  • Take ownership of the end-to-end availability and performance of our services, proactively identifying potential issues, and implementing automation to prevent the recurrence of problems.
  • Participate in an on-call rotation, ensuring our systems remain stable and responsive even during off-hours.
  • Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
  • Lead the development, implementation, and achievement of service-level objectives that are instrumental in maintaining product reliability.
  • Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines, enabling rapid and reliable software releases.
  • Automate and optimize our infrastructure provisioning, configuration, and management processes using industry-standard tools and best practices.
  • Implement and manage containerization and orchestration technologies to enhance scalability and resource utilization.
  • Maintain and enhance version control systems and repositories for codebase management.
  • Steer and drive the SRE / DevOps roadmap, assuming full ownership while actively engaging in negotiation and strategic planning to ensure its successful execution.
  • Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

AWSPythonSQLBashCloud ComputingGCPKubernetesSnowflakeAzureCI/CDRESTful APIsDevOpsTerraformTroubleshootingScriptingDebugging

Posted 28 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote employment solutions

🏒 Company: Remote - Referral Board

  • Significant and demonstrated experience as a Senior Site Reliability Engineer.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools, with a preference for GitLab CI.
  • Experience with a back-end programming language such as Elixir, Clojure, Java, Node.js, or Python.
  • Experience in a programming language used for developing SRE tooling, like Go or Python.
  • Experience running and configuring Linux systems in non-cloud environments.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving existing infrastructure.
  • Helping build the next generation of the platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to enhance scalability, stability, and reliability.

AWSPythonKubernetesGoLinuxTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote Employment and Compliance Solutions

🏒 Company: RemoteπŸ‘₯ 1001-5000πŸ’° $300,000,000 Series C almost 3 years agoπŸ«‚ Last layoff over 2 years agoHuman Resources Services

  • Significant and demonstrated experience as a Senior Site Reliability Engineer, which includes architecting, implementing, and maintaining a Platform for other teams.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools (GitLab CI is preferred).
  • Experience with a back-end programming language (Elixir, Clojure, Java, Node.js, Python, etc.).
  • Experience with a programming language for SRE tooling (Go, Python).
  • Experience running and configuring Linux systems in a non-cloud environment.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving our existing infrastructure.
  • Helping build the next generation of our platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to improve overall scalability, stability, and reliability.

AWSPythonKubernetesGoCI/CDLinuxTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ Argentina, Brazil

🧭 Full-Time

πŸ’Έ 65000.0 - 90000.0 USD per year

πŸ” Cybersecurity

🏒 Company: SecurityScorecardπŸ‘₯ 251-500πŸ’° $180,000,000 Series E almost 4 years agoSecurityRisk ManagementCyber SecuritySoftware

  • Proven experience as an SRE, DevOps Engineer, or similar role
  • Strong background in CI/CD tools (Jenkins, GitHub Actions, etc.)
  • Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Docker, Kubernetes)
  • Proficiency with infrastructure as code tools (Terraform, Ansible)
  • Experience with automated testing frameworks (Selenium, JUnit)
  • Knowledge of scripting languages (Python, Bash)
  • Familiarity with monitoring and observability tools (Prometheus, Grafana)
  • Design, implement, and maintain CI/CD pipelines
  • Enhance infrastructure as code practices
  • Optimize deployment rollbacks and improve incident response
  • Develop automated testing strategies
  • Collaborate with developers for application reliability
  • Build monitoring and alerting solutions
  • Drive improvements in observability and metrics collection
  • Participate in on-call rotations

AWSDockerPythonBashJUNITKubernetesGrafanaPrometheusSeleniumCI/CDTerraform

Posted about 2 months ago
Apply
Apply

πŸ“ Worldwide

🧭 Contract

πŸ” Software Development

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° over 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role.
  • Knowledge of monitoring and alerting tools to support on-call responsibilities.
NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 2 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science or related field
  • 5+ years experience in site reliability engineering
  • Proficiency in AWS, Azure, or Google Cloud
  • Experience with IaC tools like Terraform or CloudFormation
  • Develop and document disaster recovery plans and procedures
  • Collaborate with teams to identify and mitigate risks
  • Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 4 months ago
Apply