Apply

Senior Site Reliability Engineer

Posted 21 days agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ“ Location: Continental US or Canada, EST, CST

πŸ” Industry: Financial services

🏒 Company: Reach FinancialπŸ‘₯ 51-100Financial ServicesBankingPayments

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

πŸͺ„ Skills: AWSDockerPythonGrafanaPrometheusCI/CD

Requirements:
  • 5+ years of experience in Site Reliability Engineering, Product Engineering, or a similar role.
  • Experience with monitoring and observability tools such as Datadog, Open Telemetry, Prometheus, Grafana, or similar.
  • Strong coding skills in at least one language (Python, Javascript, Typescript, APEX, or similar).
  • Proficiency with CI/CD tools such as GitHub Actions, or similar.
  • Experience with containerization (Docker) and orchestration tools like AWS ECS.
  • Experience working with serverless architectures and event-driven systems.
  • A collaborative mindset with excellent communication skills.
Responsibilities:
  • Design and implement monitoring, alerting, and observability systems to ensure high system uptime and fast incident identification and resolution.
  • Define, implement, and monitor SLI/SLOs and error budgets in collaboration with engineering teams to ensure optimal service reliability.
  • Collaborate with development teams to design and optimize application and system performance, helping improve scalability and fault tolerance.
  • Lead incident response efforts, perform root cause analyses, and foster blameless postmortems to prevent recurrence.
  • Reduce toil by automating repetitive tasks to improve team efficiency and reduce manual intervention.
  • Manage and scale cloud infrastructure (Salesforce or AWS preferred) for critical systems.
  • Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.
Apply

Related Jobs

Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 10 hours ago
Apply
Apply

πŸ“ California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

πŸ’Έ 117600.0 - 252000.0 USD per year

πŸ” Software Development

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff almost 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
  • Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
  • Solid experience with at least one programming language: Go, Ruby or Python.
  • Advanced experience with Linux.
  • Extensive on-call experience as an SRE supporting mission critical systems.
  • Solid incident management experience across all phases.
  • Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.
  • Design, build, and maintain ClickHouse and PostgreSQL clusters.
  • Provision cloud infrastructure using configuration management and IaC tools.
  • Implement high-availability ClickHouse solutions.
  • Optimize PostgreSQL clusters for core applications.
  • Build monitoring and alerting tools to ensure resource optimization.
  • Respond to platform alerts and user emergencies.
  • Enhance infrastructure security and partner with compliance assessors.
  • Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 5 days ago
Apply
Apply

πŸ“ Colombia, USA

🧭 Contractor

πŸ” Software outsourcing

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° about 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Proven experience managing the Kubernetes infrastructure.
  • Familiarity with CI/CD pipelines, particularly TeamCity and tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, etc.
  • Strong understanding of backend systems and infrastructure management.
  • Excellent English communication skills and a Bachelor’s Degree in Computer Science or equivalent work experience.
  • Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role and knowledge of monitoring and alerting tools to support on-call responsibilities.

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 13 days ago
Apply
Apply

πŸ“ United States, Estonia

πŸ” B2B tech

🏒 Company: PactumπŸ‘₯ 51-100πŸ’° Grant 7 months agoArtificial Intelligence (AI)

  • Experienced in managing Cloud Infrastructure (GCP) via Infrastructure as Code (Terraform).
  • Excellent problem-solving skills and experience debugging complex systems and network issues.
  • Experienced in using and setting up observability tools like Opentelemetry and Grafana.
  • Proficient in programming languages such as nodejs, bash, kotlin, and python; open to learning more and writing production code.
  • Excellent English communication skills.
  • Work on cloud-based infrastructure ensuring high availability.
  • Maintain infrastructure deployment and CI/CD processes.
  • Improve developer experience for local product development.
  • Secure access to infrastructure and services.
  • Continuously improve observability stack.
  • Support negotiation infrastructure for ML and AI technologies.
  • Manage PostgreSQL and respond to escalated database issues.
  • Implement SRE concepts including SLI/SLO and production readiness.

Node.jsPostgreSQLPythonBashGCPKotlinCI/CDTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ Canada, Chile

πŸ” Technology

🏒 Company: Launchpad Technologies

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Familiarity with monitoring tools and systems.
  • Proficient in scripting languages such as Python, Bash, or Ruby.
  • Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
  • Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
  • Excellent troubleshooting and analytical skills.
  • Strong communication skills and the ability to work effectively within a team.
  • Develop, maintain, and improve automated deployment, certification, and validation pipelines.
  • Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
  • Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
  • Manage third-party services and technologies used to support the SRE discipline.
  • Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
  • Define and implement an observability framework to provide insights into system performance and behavior.
  • Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
  • Own operational incident management, providing support to related teams and individuals during incident resolution.
  • Identify and implement best practices for system reliability, security, scalability, and performance.
  • Participate in on-call rotations for system support, troubleshooting, and resolution.
  • Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
  • Develop and maintain documentation for systems, processes, and procedures.

AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Posted 2 months ago
Apply
Apply

πŸ“ US and Canada

🧭 Full-Time

πŸ’Έ 150000 - 200000 USD per year

πŸ” Healthcare

🏒 Company: Synthesis HealthπŸ‘₯ 51-100πŸ’° Seed about 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.
  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2 months ago
Apply
Apply

πŸ“ USA

🧭 Full-Time

πŸ” Cryptocurrency

🏒 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.
  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 3 months ago
Apply
Apply

πŸ“ United States

πŸ’Έ 130000 - 170000 USD per year

πŸ” Data-Powered Marketing Cloud

🏒 Company: Zeta GlobalπŸ‘₯ 1001-5000πŸ’° $105,263,174 Post-IPO Equity 5 months agoInformation ServicesAdvertisingAnalyticsMarketing

  • 7+ years of experience as an SRE.
  • 3+ years of software development experience, emphasizing automation.
  • Hands-on experience with Infrastructure as Code (IaC) tools.
  • Experience with distributed systems and microservices architecture.
  • Production experience with distributed tracing.
  • Proficiency in Python and Bash scripting.
  • Solid understanding of SLIs, SLOs, and error budgets.
  • Experience with CI/CD platforms like GitOps or Jenkins.
  • Expertise in incident management and root cause analysis.
  • Knowledge of modern deployment strategies like Canary and Blue-Green.
  • Familiarity with resiliency patterns such as circuit breakers and load balancing.
  • Experience with SQL and NoSQL databases in distributed systems.
  • Proficiency in statistical analysis related to metrics.
  • Experience with high-performance and low-latency systems.
  • Experience with cloud cost optimization strategies.
  • Familiarity with distributed messaging systems like Kafka.
  • Strong understanding of security and compliance standards in SRE.
  • Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
  • Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
  • Analyze historical data to identify areas for improvement.
  • Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
  • Reduce toil through runbook automation and record key MTTx metrics.
  • Lead design sessions focusing on capacity planning and automation.
  • Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 3 months ago
Apply
Apply

πŸ“ US, Portugal

🧭 Full-Time

πŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.
  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 3 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ $139,000 - $218,000 per year

πŸ” Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.
  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 5 months ago
Apply