Apply

Senior Site Reliability Engineer

Posted about 2 months agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ“ Location: US

πŸ’Έ Salary: 176000.0 - 205300.0 USD per year

πŸ” Industry: Logistics

🏒 Company: Flexe

⏳ Experience: 5+ years

πŸͺ„ Skills: AWSSoftware DevelopmentCloud ComputingGCPKubernetesAzureCI/CDDevOpsTerraform

Requirements:
  • 5+ years of commercial Software Development / Site Reliability / DevOps experience.
  • Bachelor’s degree in Computer Science/Engineering or equivalent experience.
  • Experience with software engineering in both scripted and compiled languages.
  • Proven track record empowering other engineers and maintaining infrastructure as code.
  • Experience with at least one Cloud Platform provider including Google Cloud Platform, Amazon Web Services, or Microsoft Azure.
  • Experience with monitoring, log management, and CI/CD tools for distributed systems.
  • Experience with or a desire to learn technologies such as Kubernetes, Istio, and Terraform.
Responsibilities:
  • Evolving technology that is at the heart of our business.
  • Responsible for architecting and implementing systems that will impact our customers directly.
  • Working directly with customers and partners to design, implement, and deliver large scale technical projects.
  • Establishing goals for individual growth and mentoring other engineers.
  • Contributing to a healthy and productive culture by designing resilient systems.
Apply

Related Jobs

Apply

πŸ“ United States

πŸ’Έ 99300.0 - 124100.0 USD per year

πŸ” Software Development

🏒 Company: NateraπŸ‘₯ 1001-5000πŸ’° $250,000,000 Post-IPO Equity over 1 year agoπŸ«‚ Last layoff almost 2 years agoWomen'sBiotechnologyMedicalGeneticsHealth Diagnostics

  • Strong all around experience in Amazon Web Services, AWS certification preferred.
  • Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
  • Solid experience with EKS, Kubernetes CKA certification preferred.
  • Strong experience with Terraform.
  • 3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
  • Good understanding of Docker and Linux / Unix administration.
  • Practical experience building CI/CD pipelines using GitLab or similar tools.
  • Practical experience managing applications deployed using Docker in Cloud.
  • Experience with container orchestration tools.
  • Strong communication skills. Be able to justify and stand for the proper solution.
  • Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
  • Manage R&D AWS Infrastructure and accounts.
  • Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
  • Build and deploy cloud-based infrastructure to support R&D.
  • Participate in architectural decisions to help improve the quality of our infrastructure and applications.
  • Work tightly with groups within and external to R&D for best overall systems design and operations.
  • Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 4 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ” Software Development

🏒 Company: Fetch

  • 1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer).
  • Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++).
  • Experience with cloud infrastructure (AWS strongly preferred).
  • Experience with containerization technologies (Docker, Kubernetes preferred).
  • Experience building CI/CD pipelines.
  • Experience with Unix/Linux operating system internals and networking.
  • Experience with analyzing and troubleshooting systems.
  • Experience monitoring and supporting microservice architectures.
  • Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience.
  • Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems by participating in the on-call rotation.
  • Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS).
  • Grow the SRE team by mentoring engineers and participating in the hiring process.

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 17 days ago
Apply
Apply

πŸ“ United States, European timezones

🧭 Full-Time

πŸ” Software Development

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure management
  • Knowledge of CI/CD processes
  • Experience with incident management
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 22 days ago
Apply
Apply

πŸ“ Americas

🧭 Full-Time

πŸ’Έ 160000.0 - 180000.0 USD per year

πŸ” Software Development

🏒 Company: Customer.ioπŸ‘₯ 251-500πŸ’° Series A about 3 years agoDigital MediaSaaSProduct SearchSoftware

  • 7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
  • Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
  • Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
  • Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
  • Deep knowledge of UNIX environments and modern collaborative development practices.
  • Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
  • Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
  • Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
  • A strong focus on proactive identification and resolving issues in production environments.
  • A self-starter who thrives in both synchronous and asynchronous work environments.
  • Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
  • Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
  • Streamline and automate processes for efficiency and reliability, removing manual toil.
  • Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
  • Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
  • Troubleshoot and resolve production issues across various services and stack levels.
  • Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
  • Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago
Apply
Apply

πŸ“ California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

πŸ’Έ 117600.0 - 252000.0 USD per year

πŸ” Software Development

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff about 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
  • Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
  • Solid experience with at least one programming language: Go, Ruby or Python.
  • Advanced experience with Linux.
  • Extensive on-call experience as an SRE supporting mission critical systems.
  • Solid incident management experience across all phases.
  • Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.
  • Design, build, and maintain ClickHouse and PostgreSQL clusters.
  • Provision cloud infrastructure using configuration management and IaC tools.
  • Implement high-availability ClickHouse solutions.
  • Optimize PostgreSQL clusters for core applications.
  • Build monitoring and alerting tools to ensure resource optimization.
  • Respond to platform alerts and user emergencies.
  • Enhance infrastructure security and partner with compliance assessors.
  • Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted about 1 month ago
Apply
Apply

πŸ“ Worldwide

🧭 Contract

πŸ” Software Development

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° over 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role.
  • Knowledge of monitoring and alerting tools to support on-call responsibilities.
NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 2 months ago
Apply
Apply

πŸ“ United States

πŸ” Cybersecurity

  • Must be a self-starter with a passion for cloud technology.
  • Strong problem-solving abilities are essential.
  • Experience in major public clouds and automation is required.
  • As a Senior Site Reliability Engineer within the Cloud Services group, you will be responsible for operating cutting-edge offerings from Cloud Service Providers.
  • You will directly support leading cloud software companies to enhance the reliability and scalability of their SaaS products.
  • This role entails problem-solving and ensuring seamless service to large enterprises and government agencies.

AWSDockerPythonCloud ComputingKubernetesDevOpsTerraform

Posted 2 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

  • Degree in Computer Science or related field
  • 5+ years experience in site reliability engineering
  • Proficiency in AWS, Azure, or Google Cloud
  • Experience with IaC tools like Terraform or CloudFormation
  • Develop and document disaster recovery plans and procedures
  • Collaborate with teams to identify and mitigate risks
  • Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 4 months ago
Apply