Senior Site Reliability Engineer

Posted about 2 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: Worldwide

🔍 Industry: Software Development

🏢 Company: Teravision Technologies👥 251-500💰 over 13 years agoAndroid iOS Mobile Apps Information Technology Software

🗣️ Languages: English

🪄 Skills: AWSKubernetesCI/CDTroubleshootingDebugging

Requirements:

Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
Hands-on experience with AWS services such as S3, Route 53, and others.
Strong understanding of backend systems and infrastructure management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role.
Knowledge of monitoring and alerting tools to support on-call responsibilities.

Responsibilities:

NOT STATEDApply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer (SRE) - Poland

Posted 3 days ago

📍 Poland

🔍 Software Development

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with sustainable incident response in a blameless environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsPythonBashCloud ComputingGitJavascriptKibanaKubernetesTypeScriptAlgorithmsData StructuresGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONAnsibleScriptingSoftware EngineeringDebugging

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 days ago

📍 United States

💸 99300.0 - 124100.0 USD per year

🔍 Software Development

🏢 Company: Natera👥 1001-5000💰 $250,000,000 Post-IPO Equity over 1 year ago🫂 Last layoff almost 2 years agoWomen's Biotechnology Medical Genetics Health Diagnostics

🔧 Requirements

Strong all around experience in Amazon Web Services, AWS certification preferred.
Experience with CloudFormation and Lambda / Serverless as part of infrastructure.
Solid experience with EKS, Kubernetes CKA certification preferred.
Strong experience with Terraform.
3+ years of experience with programming languages such as Python, Java, or similar for scripting, automation, and building tools.
Good understanding of Docker and Linux / Unix administration.
Practical experience building CI/CD pipelines using GitLab or similar tools.
Practical experience managing applications deployed using Docker in Cloud.
Experience with container orchestration tools.
Strong communication skills. Be able to justify and stand for the proper solution.

💡 Responsibilities

Develop automation and CI/CD processes to enable teams to build, test, deploy, manage, configure, secure, scale and monitor their applications using the latest technologies such as Docker, Kubernetes, Terraform and others.
Manage R&D AWS Infrastructure and accounts.
Work closely with teams inside R&D to investigate areas of improvement and eliminate bottlenecks.
Build and deploy cloud-based infrastructure to support R&D.
Participate in architectural decisions to help improve the quality of our infrastructure and applications.
Work tightly with groups within and external to R&D for best overall systems design and operations.
Be a cloud expert for your team and R&D teams.

AWSDockerPythonSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 4 days ago

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted 8 days ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity about 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

A track record of delivering capabilities that build customer value and business impact.
Knowledge of principles for building performant and quality REST APIs.
Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.

💡 Responsibilities

Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 8 days ago

Apply

🔥 Senior Site Reliability Engineer II (Kafka)

Posted 9 days ago

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Braze👥 1001-5000💰 Grant over 1 year agoCRM Analytics Marketing Marketing Automation Software

🔧 Requirements

5+ years of experience as a Software, DevOps, or Site Reliability Engineer
3+ years of Data Streaming Reliability Engineering
Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
3+ years of Kafka performance tuning & automation
Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution.
Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows.
You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
Have an urge to collaborate, document, and deliver quickly
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
Have a desire to solve everyday challenges facing software engineers and automate their toil away
Have an excellent ability to manage multiple tasks and expectations at once
Know your way around Linux and Unix Shell.
Have strong programming skills - Ruby and/or Go preferred
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

💡 Responsibilities

Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
Make monitoring and alerting alerts on symptoms and not on outages
Ensure that Braze meets our strict enterprise-grade SLAs with customers
Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc.
Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
Use your on-call shift to prevent incidents from ever happening
Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

DockerKafkaKubernetesMongoDBRubyGoRedisCI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingAnsible

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 17 days ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Fetch

🔧 Requirements

1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer).
Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++).
Experience with cloud infrastructure (AWS strongly preferred).
Experience with containerization technologies (Docker, Kubernetes preferred).
Experience building CI/CD pipelines.
Experience with Unix/Linux operating system internals and networking.
Experience with analyzing and troubleshooting systems.
Experience monitoring and supporting microservice architectures.
Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience.

💡 Responsibilities

Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems by participating in the on-call rotation.
Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS).
Grow the SRE team by mentoring engineers and participating in the hiring process.

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 17 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure management
Knowledge of CI/CD processes
Experience with incident management

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Sanity👥 51-200💰 Corporate over 2 years agoSoftware Development

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Analytical mindset for designing, diagnosing, and optimizing infrastructure.
Skilled in managing scalable, highly available, cloud-based applications.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Proficient in building and maintaining CI/CD pipelines.
Familiarity with observability tools like Prometheus and similar stacks.
Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
Open-minded yet discerning when it comes to exploring new technologies.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 22 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer - Americas

Posted 29 days ago

📍 Americas

🧭 Full-Time

💸 160000.0 - 180000.0 USD per year

🔍 Software Development

🏢 Company: Customer.io👥 251-500💰 Series A about 3 years agoDigital Media SaaS Product Search Software

🔧 Requirements

7+ years of professional experience as a Site Reliability Engineer, with proven experience leading large complex projects affecting production SaaS environments.
Professional experience with relational database systems, managing the servers and tuning performance, particularly MySQL.
Proven experience managing scale, reliability and performance challenges managing distributed applications on cloud infrastructure (Google Cloud Platform is advantageous), both managed and self-hosted solutions.
Proven ability to build cloud infrastructure using Terraform and develop operational tooling in various languages including Golang and Bash.
Deep knowledge of UNIX environments and modern collaborative development practices.
Excellent communication skills, both verbal and written, with a collaborative mindset to make informed, empathetic decisions.
Ability to work autonomously in your timezone, advancing tasks and projects with minimal guidance.
Demonstrated ability to influence product direction and contribute technical insights that help drive business value.
A strong focus on proactive identification and resolving issues in production environments.
A self-starter who thrives in both synchronous and asynchronous work environments.

💡 Responsibilities

Architect and maintain critical infrastructure to enable Customer.io to scale and handle real-time processing of billions of messages.
Strategically plan and implement infrastructure growth to meet evolving demands and repeatability.
Streamline and automate processes for efficiency and reliability, removing manual toil.
Participate in on-call rotations to swiftly address availability incidents and support technical engineers with customer-related issues.
Develop observability to ensure comprehensive monitoring and effective alerting of infrastructure and applications.
Troubleshoot and resolve production issues across various services and stack levels.
Contribute to a collaborative and supportive team environment, fostering individual, professional, and team growth.
Engage in continuous learning and knowledge sharing through code reviews, pair programming, and team collaborations to refine best practices.

Backend DevelopmentSQLBashCloud ComputingGCPKubernetesMySQLREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingSaaS

Posted 29 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago

Apply