Apply

Site Reliability Engineer

Posted about 10 hours agoViewed

View full description

πŸ’Ž Seniority level: Senior, 5+ years

πŸ” Industry: Web3

🏒 Company: SyndicaπŸ‘₯ 1-10πŸ’° $8,000,000 Seed over 3 years agoBlockchainInfrastructureWeb3Web Development

πŸ—£οΈ Languages: English

⏳ Experience: 5+ years

Requirements:
  • 5+ years of experience in a DevOps or SRE role
  • Proficiency in scripting languages (Python, Shell)
  • Experience with Kubernetes
  • Experience deploying large-scale systems reliably
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
  • Experience with at least one modern programming language (Go, Rust, Typescript, etc.)
  • Experience with at least one major cloud platform language (AWS, Azure, or GCP)
Responsibilities:
  • Administer overall site availability, security, latency, and system health.
  • Effective provisioning, installation/configuration, operation, and maintenance of services and system software and related infrastructure.
  • Develop comprehensive monitoring solutions to provide full visibility to the different system components using tools like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, etc.
  • Enable the development team to release code quickly and reliably by ensuring full observability of systems and automated detection of performance and integration issues.
  • Formulate technical performance measures and implement them using queries, logs, code instrumentation and other analytics tools.
  • Design dashboards and visualizations that effectively convey technical measures
  • Troubleshoot issues at multiple layers of deployment, from hardware, to operating environment, network, and application to conduct root cause analysis and make recommendations from your findings.
  • Work with development teams to ensure best practices for scalability, reliability, and security are designed and implemented from the start.
  • Forecast changes in demand and capacity to establish appropriate scalability plans and drive decisions on the right-sizing of servers, storage and other resources.
  • Design and perform high-throughput stress testing to determine system capacity limits and identify points of failure.
  • Troubleshoot critical customer issues related to Syndica’s RPC, APIs, and App Deployments.
Apply

Related Jobs

Apply

πŸ“ Brazil

🧭 Full-Time

πŸ” Software Development

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem
  • Solid knowledge of observability systems
  • Experience with operator-managed Infrastructure as Code, preferably cross plane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD);
  • AWS Certifications.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures)
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues. Bring wellbeing to the forefront of work, and create a supportive environment where everyone feels comfortable taking care of themselves, taking time off, and finding work-life balance.

AWSCloud ComputingGitKubernetesCI/CDDevOpsSoftware Engineering

Posted about 22 hours ago
Apply
Apply

πŸ“ Brazil

🧭 Full-Time

πŸ” Software Development

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem
  • Solid knowledge of observability systems
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset;
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD);
  • AWS Certifications.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures)
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues. Bring wellbeing to the forefront of work, and create a supportive environment where everyone feels comfortable taking care of themselves, taking time off, and finding work-life balance.

AWSDockerCloud ComputingGitKafkaKubernetesGoGrafanaPrometheusREST APICI/CDTerraformMicroservicesAnsibleSoftware Engineering

Posted about 22 hours ago
Apply
Apply

πŸ” Software Development

🏒 Company: Cloudlinux

  • Strong background in development
  • Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
  • Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
  • Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
  • Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
  • Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
  • Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.
  • Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
  • Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
  • Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
  • Automate deployment processes using Infrastructure as Code (IaC) principles.
  • Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".
Posted 1 day ago
Apply
Apply

πŸ“ United Kingdom

πŸ” Software Development

🏒 Company: StarRezπŸ‘₯ 251-500πŸ’° Private about 3 years agoConsultingSaaSProperty ManagementSoftware

  • 1+ years experience working on a SaaS platform
  • Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
  • Proficiency in at least one (or more) object-oriented programming language (C# preferable)
  • Production experience operating containerization technologies (Kubernetes).
  • Proficiency with one or more public cloud providers such as Azure, AWS or GCP
  • Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
  • Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
  • Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
  • Proven track record of maintaining highly-available and performant production environments.
  • Ability to identify and implement effective mitigation strategies and operational playbooks.
  • Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
  • Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
  • Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
  • Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
  • Participate in on-call rotations to ensure system reliability and rapid incident response.
  • Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
  • Conduct performance tests to identify and remediate bottlenecks
  • Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
  • Monitor, review and tune databases to ensure high availability and performance
  • Collaborate with product engineering teams to design/build fit-for-purpose and observable software
  • Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 1 day ago
Apply
Apply

🧭 Full-Time

πŸ” Software Development

🏒 Company: Global InfoTek, Inc.

  • Bachelor's degree in computer science, Mathematics, or equivalent technical degree; or equivalent industry experience.
  • Three-plus (3+) years of experience developing production software leveraging modern languages (including: Java, Python, Go, NodeJS, etc.)
  • One-plus (1+) years of experience developing containerized services deployed in production on orchestration platforms such as Kubernetes, Mesos, Swarm, etc.
  • Three-plus (3+) years of experience with agile and lean software development philosophies.
  • One-plus (1+) years of experience working with relational and/or non-relational databases e.g. PostgreSQL, MySQL, MongoDB, Elasticsearch etc.
  • Two-plus (2+) years of demonstrated experience with modern version control systems such as Git, Subversion, Mercurial, etc.
  • Five plus (5+) years, building and maintaining Kubernetes clusters across hybrid-cloud infrastructure
  • Eight-plus (8+) years of experience working in Operations, DevOps, or Site Reliability Engineering
  • Five-plus (5+) years in configuration / package management experience using tools like Terraform, Helm etc.
  • Five-plus (5+) years' experience with Cloud service monitoring like Prometheus, Grafana, FluentD, ElasticStack, Prometheus, SumoLogic, etc.
  • Exceptionally proficient (knowledge and work experience) in Linux system administration
  • Ability to assist with GitLab CI pipelines (build/promote artifacts and security scans)
  • Experience creating automation using APIs from Azure or Google Cloud
  • Build and maintain infrastructure as code on large scale multi-site deployments
  • Evaluate and assess new ways to scale platform capabilities
  • Automate workflows to help push the limit of the infrastructure and enable continuous delivery of capabilities onto a hybrid infrastructure
  • Troubleshoot issues until root causes are understood on high traffic production systems
  • Participate in design and code review processes
  • Interact with product owners to coordinate infrastructure changes
  • Be responsible for identifying bottlenecks and improving performance of the platform
Posted 2 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 165000.0 - 205000.0 USD per year

πŸ” Software Development

🏒 Company: CriblπŸ‘₯ 251-500πŸ’° $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 2 days ago
Apply
Apply

πŸ“ Canada

πŸ” Software Development

🏒 Company: JobgetherπŸ‘₯ 11-50πŸ’° $1,493,585 Seed about 2 years agoInternet

  • 3+ years of experience in SRE, DevOps, Cloud engineering, or software development in a full-stack environment.
  • Strong expertise with AWS services (EC2, S3, Lambda, etc.) and cloud infrastructure best practices.
  • Hands-on experience with containerization and orchestration tools like Kubernetes and ECS.
  • Proficient in infrastructure-as-code (IaC) tools like Terraform, CloudFormation, or CDK.
  • Strong knowledge of CI/CD pipelines and the ability to improve deployment speed and security.
  • Excellent problem-solving skills with experience in debugging infrastructure or application issues.
  • Bachelor's or Master’s degree in Computer Science or related field, or equivalent experience.
  • Develop and manage secure, scalable, and reliable cloud infrastructure to ensure optimal performance and availability.
  • Automate cloud infrastructure using tools like Terraform, CloudFormation, and CDK to streamline deployments.
  • Optimize cloud resources and contribute to system observability strategies to reduce downtime and improve system resilience.
  • Collaborate with cross-functional teams to design and implement new platform components leveraging infrastructure or SaaS services.
  • Participate in a low-volume on-call rotation to ensure system uptime and availability.
  • Continuously monitor and improve systems with a focus on performance, cost-efficiency, and security.

AWSCloud ComputingKubernetesCI/CDDevOpsTerraform

Posted 3 days ago
Apply
Apply

🧭 Full-Time

πŸ” Software Development

🏒 Company: KrakenπŸ‘₯ 1001-5000πŸ’° Secondary Market about 1 year agoπŸ«‚ Last layoff 4 months agoEthereumBlockchainBitcoinFinTechTrading Platform

  • 5+ years working as a Site Reliability Engineer, Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security.
  • Experience with real-time data processing technologies, such as Kafka and Debezium
  • Working experience in managing hybrid systems particularly AWS and (HashiCorp nice to have).
  • Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis
  • Experience with containerization and orchestration tools, particularly Kubernetes and Docker
  • Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or Rust).
  • Familiarity with CI/CD deployment pipelines and related tools.
  • Strong problem-solving skills and the ability to troubleshoot complex systems.
  • Design the data governance mechanisms that ensure our lakehouse is easy to interact with, secure and in compliance with all applicable regulations.
  • Implement the infrastructure we use to ingest our data, store it, catalog it with the right metadata and capture its lineage.
  • Provide a state-of-the-art suite of BI tools for multiple teams within the company.
  • Guarantee the availability, high performance, scalability and cost efficiency of our data platform.
  • Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
  • Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
  • Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments.
  • Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure.
  • Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues.
  • Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
  • Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC).
  • Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information.
  • Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration.
  • Implement effective incident response procedures and participate in on-call rotations.
  • Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions.
  • Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement.
  • Support AI/ML teams with their infra requests
Posted 5 days ago
Apply
Apply

πŸ“ New Zealand

πŸ” Software Development

🏒 Company: DatacomπŸ‘₯ 5001-10000πŸ’° $5,900,000 Series B over 21 years agoDeveloper ToolsInformation ServicesBookkeeping and PayrollInformation TechnologyCyber SecuritySoftware

  • 5+ years in Site Reliability Engineering, DevOps, or a related field, preferably within a SaaS environment or fintech/HR space.
  • Deep understanding of cloud platforms (Azure preferred)
  • Proficiency in scripting languages and a strong grasp of automation tools
  • Hands-on experience with CI/CD pipelines and monitoring solutions to ensure system health and performance.
  • Excellent analytical skills, a proactive mindset, and the ability to communicate clearly with technical and non-technical teams alike.
  • A collaborative spirit with leadership qualities, eager to mentor peers and drive innovation in a fast-paced, evolving environment.
  • Design, implement, and maintain a robust, scalable infrastructure using cloud-native technologies and infrastructure-as-code practices.
  • Develop and optimise monitoring, logging, and alerting systems to proactively detect issues and ensure high availability.
  • Lead incident response, conduct thorough root cause analyses, and drive post-mortem reviews to prevent future disruptions.
  • Work closely with development, security, and operations teams to align reliability goals with feature development and business objectives.
  • Optimise capacity planning and system performance to support a growing user base, ensuring a seamless experience even under peak loads.
  • Champion continuous improvement initiatives, automation best practices, and a culture of operational excellence across the organisation.
Posted 6 days ago
Apply
Apply

πŸ“ United States, UK, Philippines, Poland, South Africa

🧭 Permanent

πŸ” FinTech

🏒 Company: ZepzπŸ‘₯ 1001-5000πŸ’° $267,000,000 Series F 5 months agoπŸ«‚ Last layoff over 1 year agoMobile PaymentsFinancial ServicesPaymentsFinTech

  • At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
  • Understand SRE and DevOps methodologies.
  • Experience with Grafana, Loki and Prometheus.
  • Experience supporting or developing applications written in Java, Python or node.js.
  • You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
  • Our Cloud Native platform is hosted on AWS.
  • Use code to solve problems.
  • Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
  • Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
  • Lead or be involved in the troubleshooting of complex incidents and problems.
  • Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
  • Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
  • Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.

AWSNode.jsPythonSQLAgileBashCloud ComputingGitJavaKafkaKubernetesActiveMQGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingJSONAnsibleScripting

Posted 7 days ago
Apply

Related Articles

Posted 13 days ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 7 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 7 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 7 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 7 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.