Apply

Senior Site Reliability Engineer

Posted about 2 months agoViewed

View full description

πŸ’Ž Seniority level: Senior

πŸ“ Location: Worldwide

πŸ” Industry: Software Development

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° over 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

πŸ—£οΈ Languages: English

πŸͺ„ Skills: AWSKubernetesCI/CDTroubleshootingDebugging

Requirements:
  • Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role.
  • Knowledge of monitoring and alerting tools to support on-call responsibilities.
Responsibilities:
NOT STATEDApply

Related Jobs

Apply

πŸ“ Poland

πŸ” Software Development

  • Extensive experience with enterprise scale continuous delivery environments
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with sustainable incident response in a blameless environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 3 days ago
Apply
Apply

πŸ“ United States

πŸ’Έ 99300.0 - 124100.0 USD per year

πŸ” Software Development

🏒 Company: NateraπŸ‘₯ 1001-5000πŸ’° $250,000,000 Post-IPO Equity over 1 year agoπŸ«‚ Last layoff almost 2 years agoWomen'sBiotechnologyMedicalGeneticsHealth Diagnostics

  • Strong all around experience in Amazon Web Services, AWS certification preferred.
  • Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
  • Solid experience with EKS, Kubernetes CKA certification preferred.
  • Strong experience with Terraform.
  • 3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
  • Good understanding of Docker and Linux / Unix administration.
  • Practical experience building CI/CD pipelines using GitLab or similar tools.
  • Practical experience managing applications deployed using Docker in Cloud.
  • Experience with container orchestration tools.
  • Strong communication skills. Be able to justify and stand for the proper solution.
  • Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
  • Manage R&D AWS Infrastructure and accounts.
  • Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
  • Build and deploy cloud-based infrastructure to support R&D.
  • Participate in architectural decisions to help improve the quality of our infrastructure and applications.
  • Work tightly with groups within and external to R&D for best overall systems design and operations.
  • Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 4 days ago
Apply
Apply

πŸ“ United Kingdom, Canada

πŸ” Software Development

🏒 Company: GoDaddyπŸ‘₯ 5001-10000πŸ’° $800,000,000 Post-IPO Equity about 3 years agoπŸ«‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 8 days ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: BrazeπŸ‘₯ 1001-5000πŸ’° Grant over 1 year agoCRMAnalyticsMarketingMarketing AutomationSoftware

  • 5+ years of experience as a Software, DevOps, or Site Reliability Engineer
  • 3+ years of Data Streaming Reliability Engineering
  • Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
  • Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
  • 3+ years of Kafka performance tuning & automation
  • Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
  • Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
  • You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
  • Have an urge to collaborate, document, and deliver quickly
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a desire to solve everyday challenges facing software engineers and automate their toil away
  • Have an excellent ability to manage multiple tasks and expectations at once
  • Know your way around Linux and Unix Shell.
  • Have strong programming skills - Ruby and/or Go preferred
  • Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
  • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
  • Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
  • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
  • Make monitoring and alerting alerts on symptoms and not on outages
  • Ensure that Braze meets our strict enterprise-grade SLAs with customers
  • Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
  • Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
  • Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
  • Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
  • Use your on-call shift to prevent incidents from ever happening
  • Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 9 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ” Software Development

🏒 Company: Fetch

  • 1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer).
  • Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++).
  • Experience with cloud infrastructure (AWS strongly preferred).
  • Experience with containerization technologies (Docker, Kubernetes preferred).
  • Experience building CI/CD pipelines.
  • Experience with Unix/Linux operating system internals and networking.
  • Experience with analyzing and troubleshooting systems.
  • Experience monitoring and supporting microservice architectures.
  • Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience.
  • Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems by participating in the on-call rotation.
  • Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS).
  • Grow the SRE team by mentoring engineers and participating in the hiring process.

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 17 days ago
Apply
Apply

πŸ“ United States, European timezones

🧭 Full-Time

πŸ” Software Development

🏒 Company: InvertπŸ‘₯ 11-50πŸ’° $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure management
  • Knowledge of CI/CD processes
  • Experience with incident management
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: SanityπŸ‘₯ 51-200πŸ’° Corporate over 2 years agoSoftware Development

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Analytical mindset for designing, diagnosing, and optimizing infrastructure.
  • Skilled in managing scalable, highly available, cloud-based applications.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Proficient in building and maintaining CI/CD pipelines.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
  • Open-minded yet discerning when it comes to exploring new technologies.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 18 days ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 22 days ago
Apply
Apply

πŸ“ Americas

🧭 Full-Time

πŸ’Έ 160000.0 - 180000.0 USD per year

πŸ” Software Development

🏒 Company: Customer.ioπŸ‘₯ 251-500πŸ’° Series A about 3 years agoDigital MediaSaaSProduct SearchSoftware

  • 7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
  • Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
  • Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
  • Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
  • Deep knowledge of UNIX environments and modern collaborative development practices.
  • Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
  • Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
  • Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
  • A strong focus on proactive identification and resolving issues in production environments.
  • A self-starter who thrives in both synchronous and asynchronous work environments.
  • Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
  • Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
  • Streamline and automate processes for efficiency and reliability, removing manual toil.
  • Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
  • Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
  • Troubleshoot and resolve production issues across various services and stack levels.
  • Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
  • Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago
Apply
Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago
Apply