Apply

Site Reliability Engineer (Remote)

Posted 11 days agoViewed

View full description

💎 Seniority level: Senior, 6+ years

🔍 Industry: FinTech

🏢 Company: leadtech

🗣️ Languages: English

⏳ Experience: 6+ years

🪄 Skills: AWSDockerPostgreSQLPythonAgileElasticSearchJenkinsMySQLOracleGroovyRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesComplianceNetworking

Requirements:
  • Minimum 6 years of experience in systems engineering, preferably in web systems with a leadership focus on operational matters.
  • Experience in the payments, banking, or FinTech industry, with a solid understanding of industry-specific challenges and requirements.
  • Knowledge of PCI DSS (Payment Card Industry Data Security Standard) compliance and best practices for securing payment systems.
  • Strong experience with Infrastructure as Code (IaC) tools, particularly Terraform.
  • Advanced expertise in Jenkins for pipeline creation and maintenance.
  • Extensive hands-on experience with AWS services: Lambda, CloudFront, API Gateway, S3.
  • Proficiency in scripting languages: Shell, Groovy, Python, and PowerShell.
  • Strong background in administering UNIX/Linux/BSD systems.
  • Solid understanding of Cloud Computing Services like EC2, Docker, and ECS.
  • Solid understanding of networking concepts: VPC, subnets, security groups, routing, load balancing, and firewalls in cloud environments.
  • Proven experience implementing and managing network security in AWS, including WAF, NACLs, and VPN configurations.
  • Strong knowledge of IAM roles and policies, as well as security best practices in cloud-native architectures.
  • Familiarity with security monitoring tools and processes, such as vulnerability scanning, intrusion detection systems (IDS), and threat mitigation strategies.
  • Hands-on experience in designing systems with a zero-trust architecture mindset.
Responsibilities:
  • Define and drive end-to-end Agile DevOps software development lifecycle with automation at its core.
  • Establish and enforce software architecture patterns with a focus on high availability (HA), scalability, security, monitoring, and configuration of Amazon AWS services.
  • Provide architecture guidance and design patterns for PaaS and distributed systems development following the Twelve-Factor App methodology.
  • Collaborate with development teams by offering coaching and training to enhance automation, service modularity, code reusability, and implement best practices in testing, deployment, and change management processes.
  • Optimize non-production environments (development, testing, integration, and pre-production) to improve agility, efficiency, and team productivity.
  • Contribute to product definition by sharing technical vision and enabling progress towards Continuous Delivery.
  • Troubleshoot and resolve complex issues across development, testing, and production environments.
Apply

Related Jobs

Apply

📍 Germany, Spain, Portugal

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

  • 5+ years of experience in a Site Reliability Engineer or similar role.
  • 3+ years of experience with AWS services and container orchestration tools.
  • 2+ years of Kubernetes experience.
  • Strong knowledge of observability tools and principles (monitoring, logging, tracing).
  • Hands-on experience with Terraform for infrastructure as code.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience in incident management, postmortem analysis, and risk mitigation.
  • Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
  • Develop and maintain systems that are reliable, scalable, and efficient.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
  • Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
  • Automate operational tasks, incident responses, and contribute to system performance optimizations.
  • Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
  • Continuously evaluate and improve system performance, capacity, and cost efficiency.
  • Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted about 12 hours ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 130000.0 - 165000.0 USD per year

🔍 Software Development

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity almost 2 years agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • BS/MS/Ph.D. or equivalent plus 5 years experience
  • Proficient authoring scripts in one or more programming languages (e.g. Python, Ruby, Javascript).
  • Experience designing and operating high-scale patterns in AWS
  • Experience building and designing repeatable workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
  • Excellent communication skills
  • Effectively able to self-manage your time across competing projects
  • Ability to quickly understand and debug complex distributed systems
  • Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
  • Maintain and improve extensible infrastructure-as-code using Terraform
  • Learn, maintain, and improve our existing deployment strategies
  • Deliver effective observability, monitoring, and alerting patterns for KnowBe4’s applications and infrastructure
  • Act as an escalation point for identifying and resolving the root cause for production incidents
  • Provide assistance designing globally distributed systems and processes for the organization
  • Identify deficiencies in our current applications and infrastructure and correct them when found
  • Define new approaches and tailored solutions to complex technical problems
  • Act as a project leader with other Site Reliability Engineers and ensure progress is communicated effectively to project stakeholders

AWSDockerPythonSQLAWS EKSCloud ComputingDynamoDBKubernetesAlgorithmsData StructuresREST APIRustCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsScriptingDebugging

Posted 13 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 185000.0 - 200000.0 USD per year

🔍 Software Development

  • Linux System Administration
  • Experience supporting production environments running Ruby on Rails applications.
  • Proficient with cloud platforms such as AWS, GCP, or Azure.
  • Experience with EC2, RDS, VPCs, and security groups is essential.
  • Ansible or equivalent experience for managing large fleets of EC2 or similar servers.
  • Expert in using Terraform for infrastructure as code.
  • Strong experience with Kubernetes and Docker, including deployment, scaling, and management of containerized applications.
  • Extensive experience with monitoring and observability tools like Datadog, Prometheus, Grafana, ELK stack, or Splunk.
  • Ability to work with other Engineering team members on troubleshooting, support, and projects both for Production and lower level environments.
  • Deep understanding of DevOps principles, practices, and tools to drive continuous improvement in the software development lifecycle.
  • Support our EC2 infrastructure to ensure it’s properly configured, reliable, and monitored, while also helping us modernize it towards more automation and containerization.
  • Build and maintain our Ansible (and legacy Puppet) configuration management, while helping us increase our automation and reduce toil.
  • Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker.
  • Implement best practices for container orchestration and management.
  • Develop and maintain comprehensive monitoring and observability solutions using Datadog.
  • Create, enhance, and maintain continuous integration and continuous deployment pipelines using GitLab CI.
  • Implement security best practices and ensure compliance with industry standards.
  • Work closely with development teams to ensure reliability and scalability of new features and services.
  • Provide technical support and guidance on infrastructure-related issues.
  • Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

AWSDockerCloud ComputingKubernetesRuby on RailsCI/CDLinuxDevOpsTerraformMicroservicesAnsible

Posted 15 days ago
Apply
Apply

📍 Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

  • 4+ years of experience in Site Reliability Engineering or a similar role with a strong focus on cloud infrastructure.
  • Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.
  • Deep knowledge of AWS cloud services and best practices for scalable and secure architectures.
  • Hands-on experience with Confluent Cloud and Kafka for distributed data streaming.
  • Strong experience with Redis for caching and RDS for data storage.
  • Proficiency with OpenSearch/ElasticSearch/ChaosSearch for search and analytics.
  • Advanced knowledge of monitoring tools like Prometheus, Grafana, Alert Manager, and OpsGenie.
  • Experience with LaunchDarkly for feature flag management.
  • Extensive experience managing Kubernetes clusters, including Helm for package management, ArgoCD for deployments, and Istio for service mesh configurations.
  • Familiarity with Kustomize for Kubernetes resource configuration.
  • Strong problem-solving skills and ability to troubleshoot complex systems in production environments.
  • Excellent communication and collaboration skills within agile teams.
  • Design, build, and maintain highly scalable cloud infrastructure using Terraform and Terragrunt for automated resource provisioning.
  • Manage and optimize AWS cloud environments, ensuring security, cost efficiency, and high availability.
  • Oversee data streaming platforms using Confluent Cloud and Kafka, ensuring reliable data pipelines.
  • Deploy and manage Redis instances for caching and real-time data processing.
  • Implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie.
  • Enable feature flag management and controlled rollouts with LaunchDarkly.
  • Manage Kubernetes clusters, utilizing Helm, ArgoCD, Istio, and Kustomize for continuous deployment and infrastructure-as-code practices.
  • Collaborate with development teams to integrate new services into the infrastructure seamlessly.
  • Troubleshoot complex system issues to maintain high availability and performance.
  • Continuously improve automation tools, processes, and methodologies to enhance system scalability.

AWSElasticSearchKafkaKubernetesGrafanaPrometheusRedisCI/CDTerraform

Posted 21 days ago
Apply
Apply

📍 United Kingdom

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity almost 2 years agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • 5+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated CI/CD workflows.
  • Strong knowledge of AWS and Azure services.
  • Proficiency in Infrastructure-as-Code tools like Terraform or Ansible.
  • Experience with observability platforms like Prometheus, Grafana, or Datadog.
  • Proficiency in scripting languages like Python or Bash.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills for collaboration across teams.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure efforts with project needs.
  • Build and maintain infrastructure as code solutions using tools like Terraform.
  • Manage AWS/Azure services, including ECS/Container Apps and storage solutions.
  • Participate in incident response and conduct post-incident reviews.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted 3 months ago
Apply
Apply

📍 United Kingdom

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity almost 2 years agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or related field.
  • 3+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated pipelines for continuous delivery.
  • Strong knowledge of AWS/Azure services.
  • Proficiency in Terraform, Ansible, or similar tools.
  • Experience with observability platforms.
  • Proficiency in Python, Bash, or other scripting languages.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills to work effectively across teams and stakeholders.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure with project needs.
  • Build and maintain infrastructure as code solutions.
  • Manage AWS/Azure services.
  • Participate in incident response and conduct root cause analysis.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted 3 months ago
Apply
Apply

📍 Scotland

🧭 Full-Time

🔍 Technology

🏢 Company: Ivanti👥 1001-5000💰 Private about 4 years agoIT InfrastructureIT ManagementSoftware

  • BS or higher in Computer Science/Software Engineering or equivalent.
  • Enthusiastic self-starter with 2+ years industry experience.
  • Strong experience in designing, building, and managing fault-tolerant Linux and Windows-based web platforms in the Cloud (e.g., AWS, Azure, GCP).
  • Strong Kubernetes experience.
  • Strong Linux and Windows troubleshooting skills.
  • Strong experience with advanced networking, security, and programming.
  • Able to code in Java, Python, Ruby, advanced Shell Script, or other OOP languages.
  • Strong user of CI/CD tools like Azure DevOps, Chef, Ansible, Jenkins, Github Actions.
  • Experience with collaborative source control tools like Git or Subversion.
  • Ability to configure HA Proxy, Apache, Nginx, IIS.
  • Ability to configure monitoring tools like DataDog, New Relic, Splunk.
  • Familiar with databases like SQL Server, PostgreSQL, Redis, Kafka, MongoDB, Elasticsearch.
  • Strong understanding of DevOps practices and OOP principles.
  • Experience contributing to application/development projects.
  • Deploying, managing, and securing Ivanti’s customer-facing Software-as-a-Service (SaaS) web products in AWS and Azure.
  • Troubleshooting all infrastructure and application issues.
  • Working with geographically dispersed teams to solve problems.
  • Automating common and repetitive tasks.
  • Writing documentation and training material.
  • Training other colleagues.
  • Standing up new services in existing data centers.
  • Participating in an on-call rotation for 24 x 7 coverage.

AWSLeadershipPythonGCPJavaJenkinsKubernetesRubyAzureCollaborationCI/CDLinuxDocumentation

Posted 3 months ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

Understand that DevOps goes beyond automating infrastructure and focus on improving the working environment and workflows of the department and beyond.
Build and support infrastructure for our diverse environment including customer facing applications, large scale data processing, and APIs.

AWSDockerKubernetesCI/CDRESTful APIsLinuxTerraformMicroservices

Posted 4 months ago
Apply