Apply

Senior Site Reliability Engineer

Posted 1 day agoViewed

View full description

πŸ“ Location: United States, Europe

πŸ” Industry: Biotechnology

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

πŸ—£οΈ Languages: English

πŸͺ„ Skills: AWSDockerCI/CDLinuxTerraform

Requirements:
  • Experience in cloud infrastructure
  • Strong incident management skills
  • Technical skills in software reliability
Responsibilities:
  • Design, build, and maintain scalable cloud infrastructure
  • Develop and enforce SLIs and SLOs
  • Create CI/CD pipelines
  • Lead Incident Management process
Apply

Related Jobs

Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 6 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 17 days ago
Apply
Apply

πŸ“ California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

πŸ’Έ 117600.0 - 252000.0 USD per year

πŸ” Software Development

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff about 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
  • Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
  • Solid experience with at least one programming language: Go, Ruby or Python.
  • Advanced experience with Linux.
  • Extensive on-call experience as an SRE supporting mission critical systems.
  • Solid incident management experience across all phases.
  • Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.
  • Design, build, and maintain ClickHouse and PostgreSQL clusters.
  • Provision cloud infrastructure using configuration management and IaC tools.
  • Implement high-availability ClickHouse solutions.
  • Optimize PostgreSQL clusters for core applications.
  • Build monitoring and alerting tools to ensure resource optimization.
  • Respond to platform alerts and user emergencies.
  • Enhance infrastructure security and partner with compliance assessors.
  • Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 21 days ago
Apply
Apply

πŸ“ United Kingdom

🧭 Contract

πŸ” SaaS platform accelerating digital transformation in the restaurant industry

NOT STATED
NOT STATED

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 27 days ago
Apply
Apply

πŸ“ Colombia, USA

🧭 Contract

πŸ” Software Development

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° about 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role.
  • Knowledge of monitoring and alerting tools to support on-call responsibilities.
  • Bachelor’s Degree in Computer Science or equivalent work experience.
NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 29 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 127000.0 - 249000.0 USD per year

πŸ” Database and Cloud Services

🏒 Company: MongoDBπŸ‘₯ 1001-5000πŸ’° Post-IPO Equity almost 7 years agoDatabaseOpen SourceCloud ComputingSaaSSoftware

  • Experience running a mission-critical service at scale.
  • Understanding of information security issues.
  • Prior experience with critical production systems in a Linux environment.
  • Proficiency in at least one modern programming language, beyond basic scripting.
  • Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc.).
  • Bachelor’s degree in Computer Science or equivalent experience.
  • Experience writing automation tools and eagerness to automate.
  • Design and build the infrastructure for a global cloud service that comprises hundreds of thousands of MongoDB clusters.
  • Implement and troubleshoot automation and monitoring of global services spanning several cloud providers.
  • Optimize infrastructure performance from application level to firmware.
  • Participate in a weekly on-call rotation.
  • Improve infrastructure capabilities, focusing on cost, simplicity, and maintainability.

Linux

Posted 3 months ago
Apply
Apply

πŸ“ Spain

🧭 Full-Time

πŸ” Mobility services

🏒 Company: CabifyπŸ‘₯ 1001-5000πŸ’° $16,473,668 Debt Financing about 1 year agoInternetLogisticsRide SharingTransportationMobile

  • Strong knowledge of Unix, networking stack, OSI model, containers, and monitoring.
  • Programming skills in at least one language; capability to learn others.
  • Natural tendency to automate tasks.
  • Effective and asynchronous communication skills.
  • Care for the company, team, and self.
  • Embrace diversity and humility.
  • Action-oriented and iterative problem solving.
  • Preference for simplicity over complexity.
  • Ability to identify and address bottlenecks.
  • Proficiency in English communication.
  • Evolving our infrastructure platform building self-service components.
  • Working closely with Product and Infrastructure teams to develop infrastructure components.
  • Designing and implementing tooling for service availability, scalability, observability, and latency improvements.
  • Increasing reliability awareness with teams and reviewing implementations.
  • Defining SLIs, SLOs and SLAs as part of services' lifecycle.
  • Sharing an on-call schedule for owned platform services.
  • Solving problems in a highly available platform and building automations to prevent incidents.
  • Participating in the recruiting process to grow the engineering team.

AWSAWS EKSKubernetesMicroservicesNetworking

Posted 3 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science or related field
  • 5+ years experience in site reliability engineering
  • Proficiency in AWS, Azure, or Google Cloud
  • Experience with IaC tools like Terraform or CloudFormation
  • Develop and document disaster recovery plans and procedures
  • Collaborate with teams to identify and mitigate risks
  • Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 3 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Security and fraud detection

🏒 Company: DataVisor

  • 5+ years of experience with production environment running Linux.
  • 3+ years of experience with cloud solutions such as AWS, Azure, or Aliyun.
  • Familiarity with big data technologies such as Spark and/or Flink.
  • Passion for automating tasks through coding and scripting.
  • Experience with algorithms, data structures, complexity analysis, and software design.
  • Proficient coding skills in Python, Java, and Bash.
  • Design, implement, and maintain release automation pipelines to streamline the deployment process.
  • Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
  • Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling data processing pipelines.
  • Perform maintenance and troubleshooting for databases, preferably Yugabyte, ClickHouse, and MySQL.
  • Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
  • Participate in on-call rotation for system reliability, focusing on automation to minimize manual intervention.
  • Collaborate with engineering teams to enhance system performance and manage capacity planning.

Linux

Posted 3 months ago
Apply
Apply

πŸ“ USA

🧭 Full-Time

πŸ” Cryptocurrency

🏒 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.
  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 4 months ago
Apply