Site Reliability Engineer

Posted about 10 hours agoViewed

💎 Seniority level: Senior, 5+ years

🔍 Industry: Web3

🏢 Company: Syndica👥 1-10💰 $8,000,000 Seed over 3 years agoBlockchain Infrastructure Web3 Web Development

🗣️ Languages: English

⏳ Experience: 5+ years

Requirements:

5+ years of experience in a DevOps or SRE role

Proficiency in scripting languages (Python, Shell)

Experience with Kubernetes

Experience deploying large-scale systems reliably

Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)

Experience with at least one modern programming language (Go, Rust, Typescript, etc.)

Experience with at least one major cloud platform language (AWS, Azure, or GCP)

Responsibilities:

Administer overall site availability, security, latency, and system health.

Effective provisioning, installation/configuration, operation, and maintenance of services and system software and related infrastructure.

Develop comprehensive monitoring solutions to provide full visibility to the different system components using tools like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, etc.

Enable the development team to release code quickly and reliably by ensuring full observability of systems and automated detection of performance and integration issues.

Formulate technical performance measures and implement them using queries, logs, code instrumentation and other analytics tools.

Design dashboards and visualizations that effectively convey technical measures

Troubleshoot issues at multiple layers of deployment, from hardware, to operating environment, network, and application to conduct root cause analysis and make recommendations from your findings.

Work with development teams to ensure best practices for scalability, reliability, and security are designed and implemented from the start.

Forecast changes in demand and capacity to establish appropriate scalability plans and drive decisions on the right-sizing of servers, storage and other resources.

Design and perform high-throughput stress testing to determine system capacity limits and identify points of failure.

Troubleshoot critical customer issues related to Syndica’s RPC, APIs, and App Deployments.

Apply

Related Jobs

Apply

🔥 Lead Site Reliability Engineer

Posted about 22 hours ago

📍 Brazil

🧭 Full-Time

🔍 Software Development

🔧 Requirements

Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
Deep knowledge of Kubernetes and its ecosystem
Solid knowledge of observability systems
Experience with operator-managed Infrastructure as Code, preferably cross plane or Kubernetes Operators.
Ability to write software for production environments.
Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD);
AWS Certifications.

💡 Responsibilities

Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
Improve observability, reliability, and cost awareness.
Support engineering teams in the products and tools usage.
Build and maintain a modern CI/CD set of tools and services.
Keep all the Kubernetes clusters highly available and reliable.
Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures)
Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues. Bring wellbeing to the forefront of work, and create a supportive environment where everyone feels comfortable taking care of themselves, taking time off, and finding work-life balance.

AWSCloud ComputingGitKubernetesCI/CDDevOpsSoftware Engineering

Posted about 22 hours ago

Apply

🔥 Lead Site Reliability Engineer

Posted about 22 hours ago

📍 Brazil

🧭 Full-Time

🔍 Software Development

🔧 Requirements

Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
Deep knowledge of Kubernetes and its ecosystem
Solid knowledge of observability systems
Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
Ability to write software for production environments.
Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
Collaboration and learning-driven mindset;
CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD);
AWS Certifications.

💡 Responsibilities

Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
Improve observability, reliability, and cost awareness.
Support engineering teams in the products and tools usage.
Build and maintain a modern CI/CD set of tools and services.
Keep all the Kubernetes clusters highly available and reliable.
Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures)
Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues. Bring wellbeing to the forefront of work, and create a supportive environment where everyone feels comfortable taking care of themselves, taking time off, and finding work-life balance.

AWSDockerCloud ComputingGitKafkaKubernetesGoGrafanaPrometheusREST APICI/CDTerraformMicroservicesAnsibleSoftware Engineering

Posted about 22 hours ago

Apply

🔥 Senior Site Reliability Engineer (SRE) for Release Engineering (remote-only)

Posted 1 day ago

🔍 Software Development

🏢 Company: Cloudlinux

🔧 Requirements

Strong background in development
Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.

💡 Responsibilities

Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
Automate deployment processes using Infrastructure as Code (IaC) principles.
Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer

Posted 1 day ago

📍 United Kingdom

🔍 Software Development

🏢 Company: StarRez👥 251-500💰 Private about 3 years agoConsulting SaaS Property Management Software

🔧 Requirements

1+ years experience working on a SaaS platform
Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
Proficiency in at least one (or more) object-oriented programming language (C# preferable)
Production experience operating containerization technologies (Kubernetes).
Proficiency with one or more public cloud providers such as Azure, AWS or GCP
Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
Proven track record of maintaining highly-available and performant production environments.
Ability to identify and implement effective mitigation strategies and operational playbooks.

💡 Responsibilities

Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
Participate in on-call rotations to ensure system reliability and rapid incident response.
Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
Conduct performance tests to identify and remediate bottlenecks
Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
Monitor, review and tune databases to ensure high availability and performance
Collaborate with product engineering teams to design/build fit-for-purpose and observable software
Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 1 day ago

Apply

🔥 Principal Site Reliability Engineer (Remote) KRWFH 1584

Posted 2 days ago

🧭 Full-Time

🔍 Software Development

🏢 Company: Global InfoTek, Inc.

🔧 Requirements

Bachelor's degree in computer science, Mathematics, or equivalent technical degree; or equivalent industry experience.
Three-plus (3+) years of experience developing production software leveraging modern languages (including: Java, Python, Go, NodeJS, etc.)
One-plus (1+) years of experience developing containerized services deployed in production on orchestration platforms such as Kubernetes, Mesos, Swarm, etc.
Three-plus (3+) years of experience with agile and lean software development philosophies.
One-plus (1+) years of experience working with relational and/or non-relational databases e.g. PostgreSQL, MySQL, MongoDB, Elasticsearch etc.
Two-plus (2+) years of demonstrated experience with modern version control systems such as Git, Subversion, Mercurial, etc.
Five plus (5+) years, building and maintaining Kubernetes clusters across hybrid-cloud infrastructure
Eight-plus (8+) years of experience working in Operations, DevOps, or Site Reliability Engineering
Five-plus (5+) years in configuration / package management experience using tools like Terraform, Helm etc.
Five-plus (5+) years' experience with Cloud service monitoring like Prometheus, Grafana, FluentD, ElasticStack, Prometheus, SumoLogic, etc.
Exceptionally proficient (knowledge and work experience) in Linux system administration
Ability to assist with GitLab CI pipelines (build/promote artifacts and security scans)
Experience creating automation using APIs from Azure or Google Cloud

💡 Responsibilities

Build and maintain infrastructure as code on large scale multi-site deployments
Evaluate and assess new ways to scale platform capabilities
Automate workflows to help push the limit of the infrastructure and enable continuous delivery of capabilities onto a hybrid infrastructure
Troubleshoot issues until root causes are understood on high traffic production systems
Participate in design and code review processes
Interact with product owners to coordinate infrastructure changes
Be responsible for identifying bottlenecks and improving performance of the platform

Posted 2 days ago

Apply

🔥 Sr Site Reliability Engineer (SRE)

Posted 2 days ago

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
5+ years of experience with a DevOps or SRE job title
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 2 days ago

Apply

🔥 Site Reliability Engineer, Customer Security - (Remote - Canada)

Posted 3 days ago

📍 Canada

🔍 Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

🔧 Requirements

3+ years of experience in SRE, DevOps, Cloud engineering, or software development in a full-stack environment.
Strong expertise with AWS services (EC2, S3, Lambda, etc.) and cloud infrastructure best practices.
Hands-on experience with containerization and orchestration tools like Kubernetes and ECS.
Proficient in infrastructure-as-code (IaC) tools like Terraform, CloudFormation, or CDK.
Strong knowledge of CI/CD pipelines and the ability to improve deployment speed and security.
Excellent problem-solving skills with experience in debugging infrastructure or application issues.
Bachelor's or Master’s degree in Computer Science or related field, or equivalent experience.

💡 Responsibilities

Develop and manage secure, scalable, and reliable cloud infrastructure to ensure optimal performance and availability.
Automate cloud infrastructure using tools like Terraform, CloudFormation, and CDK to streamline deployments.
Optimize cloud resources and contribute to system observability strategies to reduce downtime and improve system resilience.
Collaborate with cross-functional teams to design and implement new platform components leveraging infrastructure or SaaS services.
Participate in a low-volume on-call rotation to ensure system uptime and availability.
Continuously monitor and improve systems with a focus on performance, cost-efficiency, and security.

AWSCloud ComputingKubernetesCI/CDDevOpsTerraform

Posted 3 days ago

Apply

🔥 Site Reliability Engineer - Data Platform

Posted 5 days ago

🧭 Full-Time

🔍 Software Development

🏢 Company: Kraken👥 1001-5000💰 Secondary Market about 1 year ago🫂 Last layoff 4 months agoEthereum Blockchain Bitcoin FinTech Trading Platform

🔧 Requirements

5+ years working as a Site Reliability Engineer, Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security.
Experience with real-time data processing technologies, such as Kafka and Debezium
Working experience in managing hybrid systems particularly AWS and (HashiCorp nice to have).
Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis
Experience with containerization and orchestration tools, particularly Kubernetes and Docker
Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or Rust).
Familiarity with CI/CD deployment pipelines and related tools.
Strong problem-solving skills and the ability to troubleshoot complex systems.

💡 Responsibilities

Design the data governance mechanisms that ensure our lakehouse is easy to interact with, secure and in compliance with all applicable regulations.
Implement the infrastructure we use to ingest our data, store it, catalog it with the right metadata and capture its lineage.
Provide a state-of-the-art suite of BI tools for multiple teams within the company.
Guarantee the availability, high performance, scalability and cost efficiency of our data platform.
Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments.
Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure.
Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues.
Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC).
Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information.
Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration.
Implement effective incident response procedures and participate in on-call rotations.
Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions.
Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement.
Support AI/ML teams with their infra requests

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer - NZ

Posted 6 days ago

📍 New Zealand

🔍 Software Development

🏢 Company: Datacom👥 5001-10000💰 $5,900,000 Series B over 21 years agoDeveloper Tools Information Services Bookkeeping and Payroll Information Technology Cyber Security Software

🔧 Requirements

5+ years in Site Reliability Engineering, DevOps, or a related field, preferably within a SaaS environment or fintech/HR space.
Deep understanding of cloud platforms (Azure preferred)
Proficiency in scripting languages and a strong grasp of automation tools
Hands-on experience with CI/CD pipelines and monitoring solutions to ensure system health and performance.
Excellent analytical skills, a proactive mindset, and the ability to communicate clearly with technical and non-technical teams alike.
A collaborative spirit with leadership qualities, eager to mentor peers and drive innovation in a fast-paced, evolving environment.

💡 Responsibilities

Design, implement, and maintain a robust, scalable infrastructure using cloud-native technologies and infrastructure-as-code practices.
Develop and optimise monitoring, logging, and alerting systems to proactively detect issues and ensure high availability.
Lead incident response, conduct thorough root cause analyses, and drive post-mortem reviews to prevent future disruptions.
Work closely with development, security, and operations teams to align reliability goals with feature development and business objectives.
Optimise capacity planning and system performance to support a growing user base, ensuring a seamless experience even under peak loads.
Champion continuous improvement initiatives, automation best practices, and a culture of operational excellence across the organisation.

Posted 6 days ago

Apply

🔥 Site Reliability Engineer

Posted 7 days ago

📍 United States, UK, Philippines, Poland, South Africa

🧭 Permanent

🔍 FinTech

🏢 Company: Zepz👥 1001-5000💰 $267,000,000 Series F 5 months ago🫂 Last layoff over 1 year agoMobile Payments Financial Services Payments FinTech

🔧 Requirements

At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
Understand SRE and DevOps methodologies.
Experience with Grafana, Loki and Prometheus.
Experience supporting or developing applications written in Java, Python or node.js.
You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
Our Cloud Native platform is hosted on AWS.

💡 Responsibilities

Use code to solve problems.
Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
Lead or be involved in the troubleshooting of complex incidents and problems.
Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.

AWSNode.jsPythonSQLAgileBashCloud ComputingGitJavaKafkaKubernetesActiveMQGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingJSONAnsibleScripting

Posted 7 days ago

Apply

Why remote work is such a nice opportunity?

Posted 13 days ago

Why is remote work so nice? Let's try to see!

Remote Job Certifications and Courses to Boost Your Career

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 7 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 7 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 7 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Why remote work is such a nice opportunity?

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully