Apply

Senior Site Reliability Engineer

Posted over 1 year agoViewed

View full description

πŸ” Industry: Financial risk management

πŸ—£οΈ Languages: English

πŸͺ„ Skills: DockerPythonKubernetesC (Programming language)

Requirements:
  • A bachelor's degree in computer science, information systems, or the equivalent combination of education, experience, and training
  • Fluency in english, both written and spoken
  • 4+ years of experience with aws or azure
  • Experience with automation, infrastructure-as-code, terraform, ansible, runbooks and troubleshooting guides
  • Experience with virtualization, container technologies and orchestration (docker, kubernetes)
  • Programming skills (go, python, or similar languages)
  • Experience with ci/cd pipelines
  • Experience with monitoring, troubleshooting and guiding on incidents
  • Self-driven & motivated, with a strong work ethic and a passion for problem-solving;
Responsibilities:
  • Build and maintain tools for deployment, monitoring, operations, and analytics
  • Development with go, python, or similar languages
  • Document and guide engineers through playbooks and troubleshooting guides
  • Contribute to the applications self-healing in a cloud-based environment
  • Leverage, configure and troubleshoot cloud resources in aws
  • Migrate and operate workloads in kubernetes
  • Participate in incident response, root cause investigation, and resolution
  • Maintain and develop our infrastructure as code (iac) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
  • Utilize your experience and problem-solving skills to help prevent and investigate production issues
  • Communicate with team members and stakeholders in a globally distributed and asynchronous environment
  • Investigate, describe, and drive improvements on current infrastructure, promoting evolution and sharing knowledge amongst the team;
Apply

Related Jobs

Apply

πŸ“ United Kingdom

πŸ” Software Development

🏒 Company: StarRezπŸ‘₯ 251-500πŸ’° Private about 3 years agoConsultingSaaSProperty ManagementSoftware

  • 1+ years experience working on a SaaS platform
  • Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
  • Proficiency in at least one (or more) object-oriented programming language (C# preferable)
  • Production experience operating containerization technologies (Kubernetes).
  • Proficiency with one or more public cloud providers such as Azure, AWS or GCP
  • Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
  • Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
  • Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
  • Proven track record of maintaining highly-available and performant production environments.
  • Ability to identify and implement effective mitigation strategies and operational playbooks.
  • Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
  • Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
  • Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
  • Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
  • Participate in on-call rotations to ensure system reliability and rapid incident response.
  • Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
  • Conduct performance tests to identify and remediate bottlenecks
  • Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
  • Monitor, review and tune databases to ensure high availability and performance
  • Collaborate with product engineering teams to design/build fit-for-purpose and observable software
  • Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 1 day ago
Apply
Apply

πŸ“ Poland

πŸ” Software Development

  • Extensive experience with enterprise scale continuous delivery environments
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with sustainable incident response in a blameless environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 5 days ago
Apply
Apply

πŸ“ United States

πŸ’Έ 99300.0 - 124100.0 USD per year

πŸ” Software Development

🏒 Company: NateraπŸ‘₯ 1001-5000πŸ’° $250,000,000 Post-IPO Equity over 1 year agoπŸ«‚ Last layoff almost 2 years agoWomen'sBiotechnologyMedicalGeneticsHealth Diagnostics

  • Strong all around experience in Amazon Web Services, AWS certification preferred.
  • Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
  • Solid experience with EKS, Kubernetes CKA certification preferred.
  • Strong experience with Terraform.
  • 3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
  • Good understanding of Docker and Linux / Unix administration.
  • Practical experience building CI/CD pipelines using GitLab or similar tools.
  • Practical experience managing applications deployed using Docker in Cloud.
  • Experience with container orchestration tools.
  • Strong communication skills. Be able to justify and stand for the proper solution.
  • Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
  • Manage R&D AWS Infrastructure and accounts.
  • Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
  • Build and deploy cloud-based infrastructure to support R&D.
  • Participate in architectural decisions to help improve the quality of our infrastructure and applications.
  • Work tightly with groups within and external to R&D for best overall systems design and operations.
  • Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 6 days ago
Apply
Apply

πŸ“ United Kingdom, Canada

πŸ” Software Development

🏒 Company: GoDaddyπŸ‘₯ 5001-10000πŸ’° $800,000,000 Post-IPO Equity about 3 years agoπŸ«‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 11 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: BrazeπŸ‘₯ 1001-5000πŸ’° Grant over 1 year agoCRMAnalyticsMarketingMarketing AutomationSoftware

  • 5+ years of experience as a Software, DevOps, or Site Reliability Engineer
  • 3+ years of Data Streaming Reliability Engineering
  • Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
  • Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
  • 3+ years of Kafka performance tuning & automation
  • Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
  • Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
  • You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
  • Have an urge to collaborate, document, and deliver quickly
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a desire to solve everyday challenges facing software engineers and automate their toil away
  • Have an excellent ability to manage multiple tasks and expectations at once
  • Know your way around Linux and Unix Shell.
  • Have strong programming skills - Ruby and/or Go preferred
  • Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
  • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
  • Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
  • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
  • Make monitoring and alerting alerts on symptoms and not on outages
  • Ensure that Braze meets our strict enterprise-grade SLAs with customers
  • Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
  • Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
  • Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
  • Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
  • Use your on-call shift to prevent incidents from ever happening
  • Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 11 days ago
Apply
Apply

πŸ“ United States, European timezones

🧭 Full-Time

πŸ” Software Development

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

NOT STATED
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 20 days ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: SanityπŸ‘₯ 51-200πŸ’° Corporate over 2 years agoSoftware Development

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Analytical mindset for designing, diagnosing, and optimizing infrastructure.
  • Skilled in managing scalable, highly available, cloud-based applications.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Proficient in building and maintaining CI/CD pipelines.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
  • Open-minded yet discerning when it comes to exploring new technologies.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 20 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 25 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: VantageπŸ‘₯ 1001-5000CryptocurrencyFinancial ServicesFinTechTrading Platform

  • 5 years of experience as a Site Reliability Engineer or DevOps Engineer, working with software and infrastructure.
  • Experience in one or more of the following: Python, Javascript, Ruby, Groovy, PHP, or Bash.
  • Experience in one of the cloud platforms: Azure, AWS, or GCP.
  • Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
  • Take ownership of the end-to-end availability and performance of our services, proactively identifying potential issues, and implementing automation to prevent the recurrence of problems.
  • Participate in an on-call rotation, ensuring our systems remain stable and responsive even during off-hours.
  • Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
  • Lead the development, implementation, and achievement of service-level objectives that are instrumental in maintaining product reliability.
  • Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines, enabling rapid and reliable software releases.
  • Automate and optimize our infrastructure provisioning, configuration, and management processes using industry-standard tools and best practices.
  • Implement and manage containerization and orchestration technologies to enhance scalability and resource utilization.
  • Maintain and enhance version control systems and repositories for codebase management.
  • Steer and drive the SRE / DevOps roadmap, assuming full ownership while actively engaging in negotiation and strategic planning to ensure its successful execution.
  • Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

AWSPythonSQLBashCloud ComputingGCPKubernetesSnowflakeAzureCI/CDRESTful APIsDevOpsTerraformTroubleshootingScriptingDebugging

Posted 28 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago
Apply