Apply

Senior Site Reliability Engineer

Posted 8 days agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ“ Location: USA

πŸ” Industry: Software Development

🏒 Company: DandyπŸ‘₯ 501-1000Food and BeverageFood Processing

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

πŸͺ„ Skills: GraphQLNode.jsPostgreSQLCloud ComputingGCPKubernetesTypeScriptNest.jsCI/CDDevOpsTerraformSoftware Engineering

Requirements:
  • 5+ years of software engineering experience, preferably in a high growth startup environment
  • An expert in Google Cloud Platform and Google Kubernetes Engine
  • Experience with infrastructure as code platforms (Terraform, Pulumi)
  • Experience creating and maintaining fully automated CI/CD build processes for multiple environments
  • Experience designing the architecture and automation of infrastructure within a cloud environment
Responsibilities:
  • Develop and maintain infrastructure, systems, and tooling to support Dandy’s products in a secure, well-tested, and performant way.
  • Reinvent an analog experience and disrupt a legacy industry through novel and scalable system design.
  • Collaborate with Product Engineers and other stakeholders within Engineering, Product and Data to maintain a high bar for quality in a fast-paced, iterative environment.
  • Advocate for improvements to infrastructure quality, security, and performance.
  • Craft code that meets our internal standards for style, maintainability, and best practices.
  • Recognize impediments to our efficiency as a team ("technical debt"), propose and implement solutions.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ” Software Development

🏒 Company: Fetch

  • 1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer)
  • Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++)
  • Experience with cloud infrastructure (AWS strongly preferred)
  • Experience with containerization technologies (Docker, Kubernetes preferred)
  • Experience building CI/CD pipelines
  • Experience with Unix/Linux operating system internals and networking
  • Experience with analyzing and troubleshooting systems
  • Experience monitoring and supporting microservice architectures
  • Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience
  • Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems by participating in the on-call rotation
  • Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS)
  • Grow the SRE team by mentoring engineers and participating in the hiring process

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 2 days ago
Apply
Apply

πŸ“ United States, Europe

🧭 Full-Time

πŸ” Biotechnology

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure
  • Strong incident management skills
  • Technical skills in software reliability
  • Design, build, and maintain scalable cloud infrastructure
  • Develop and enforce SLIs and SLOs
  • Create CI/CD pipelines
  • Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted 3 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 7 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 18 days ago
Apply
Apply

πŸ“ California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

πŸ’Έ 117600.0 - 252000.0 USD per year

πŸ” Software Development

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff about 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
  • Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
  • Solid experience with at least one programming language: Go, Ruby or Python.
  • Advanced experience with Linux.
  • Extensive on-call experience as an SRE supporting mission critical systems.
  • Solid incident management experience across all phases.
  • Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.
  • Design, build, and maintain ClickHouse and PostgreSQL clusters.
  • Provision cloud infrastructure using configuration management and IaC tools.
  • Implement high-availability ClickHouse solutions.
  • Optimize PostgreSQL clusters for core applications.
  • Build monitoring and alerting tools to ensure resource optimization.
  • Respond to platform alerts and user emergencies.
  • Enhance infrastructure security and partner with compliance assessors.
  • Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 23 days ago
Apply
Apply

πŸ“ Colombia, USA

🧭 Contract

πŸ” Software Development

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° about 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role.
  • Knowledge of monitoring and alerting tools to support on-call responsibilities.
  • Bachelor’s Degree in Computer Science or equivalent work experience.
NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 1 month ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science or related field
  • 5+ years experience in site reliability engineering
  • Proficiency in AWS, Azure, or Google Cloud
  • Experience with IaC tools like Terraform or CloudFormation
  • Develop and document disaster recovery plans and procedures
  • Collaborate with teams to identify and mitigate risks
  • Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 3 months ago
Apply
Apply

πŸ“ USA

🧭 Full-Time

πŸ” Cryptocurrency

🏒 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.
  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 4 months ago
Apply
Apply

πŸ“ United States

πŸ’Έ 130000 - 170000 USD per year

πŸ” Data-Powered Marketing Cloud

🏒 Company: Zeta GlobalπŸ‘₯ 1001-5000πŸ’° $105,263,174 Post-IPO Equity 6 months agoInformation ServicesAdvertisingAnalyticsMarketing

  • 7+ years of experience as an SRE.
  • 3+ years of software development experience, emphasizing automation.
  • Hands-on experience with Infrastructure as Code (IaC) tools.
  • Experience with distributed systems and microservices architecture.
  • Production experience with distributed tracing.
  • Proficiency in Python and Bash scripting.
  • Solid understanding of SLIs, SLOs, and error budgets.
  • Experience with CI/CD platforms like GitOps or Jenkins.
  • Expertise in incident management and root cause analysis.
  • Knowledge of modern deployment strategies like Canary and Blue-Green.
  • Familiarity with resiliency patterns such as circuit breakers and load balancing.
  • Experience with SQL and NoSQL databases in distributed systems.
  • Proficiency in statistical analysis related to metrics.
  • Experience with high-performance and low-latency systems.
  • Experience with cloud cost optimization strategies.
  • Familiarity with distributed messaging systems like Kafka.
  • Strong understanding of security and compliance standards in SRE.
  • Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
  • Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
  • Analyze historical data to identify areas for improvement.
  • Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
  • Reduce toil through runbook automation and record key MTTx metrics.
  • Lead design sessions focusing on capacity planning and automation.
  • Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 4 months ago
Apply
Apply

πŸ“ US, Portugal

🧭 Full-Time

πŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.
  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 4 months ago
Apply