Apply

Senior Site Reliability Engineer

Posted over 1 year agoViewed

View full description

πŸ” Industry: Financial risk management

πŸ—£οΈ Languages: English

πŸͺ„ Skills: DockerPythonKubernetesC (Programming language)

Requirements:
  • A bachelor's degree in computer science, information systems, or the equivalent combination of education, experience, and training
  • Fluency in english, both written and spoken
  • 4+ years of experience with aws or azure
  • Experience with automation, infrastructure-as-code, terraform, ansible, runbooks and troubleshooting guides
  • Experience with virtualization, container technologies and orchestration (docker, kubernetes)
  • Programming skills (go, python, or similar languages)
  • Experience with ci/cd pipelines
  • Experience with monitoring, troubleshooting and guiding on incidents
  • Self-driven & motivated, with a strong work ethic and a passion for problem-solving;
Responsibilities:
  • Build and maintain tools for deployment, monitoring, operations, and analytics
  • Development with go, python, or similar languages
  • Document and guide engineers through playbooks and troubleshooting guides
  • Contribute to the applications self-healing in a cloud-based environment
  • Leverage, configure and troubleshoot cloud resources in aws
  • Migrate and operate workloads in kubernetes
  • Participate in incident response, root cause investigation, and resolution
  • Maintain and develop our infrastructure as code (iac) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
  • Utilize your experience and problem-solving skills to help prevent and investigate production issues
  • Communicate with team members and stakeholders in a globally distributed and asynchronous environment
  • Investigate, describe, and drive improvements on current infrastructure, promoting evolution and sharing knowledge amongst the team;
Apply

Related Jobs

Apply

πŸ“ USA, CAN, MEX

πŸ” Transportation technology

🏒 Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 2 days ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote employment solutions

🏒 Company: Remote - Referral Board

  • Significant and demonstrated experience as a Senior Site Reliability Engineer.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools, with a preference for GitLab CI.
  • Experience with a back-end programming language such as Elixir, Clojure, Java, Node.js, or Python.
  • Experience in a programming language used for developing SRE tooling, like Go or Python.
  • Experience running and configuring Linux systems in non-cloud environments.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving existing infrastructure.
  • Helping build the next generation of the platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to enhance scalability, stability, and reliability.

AWSPythonKubernetesGoLinuxTerraform

Posted 5 days ago
Apply
Apply

πŸ“ LATAM

🧭 Full-Time

πŸ’Έ 51850.0 - 116650.0 USD per year

πŸ” Remote Employment and Compliance Solutions

🏒 Company: RemoteπŸ‘₯ 1001-5000πŸ’° $300,000,000 Series C almost 3 years agoπŸ«‚ Last layoff over 2 years agoHuman Resources Services

  • Significant and demonstrated experience as a Senior Site Reliability Engineer, which includes architecting, implementing, and maintaining a Platform for other teams.
  • Solid knowledge and experience in Kubernetes, AWS (or similar Cloud Provider), and Terraform.
  • Knowledge of CI/CD tools (GitLab CI is preferred).
  • Experience with a back-end programming language (Elixir, Clojure, Java, Node.js, Python, etc.).
  • Experience with a programming language for SRE tooling (Go, Python).
  • Experience running and configuring Linux systems in a non-cloud environment.
  • Security knowledge from both defensive and offensive perspectives.
  • Excellent communication and interpersonal skills.
  • Managing and improving our existing infrastructure.
  • Helping build the next generation of our platform using tools like Kubernetes, Terraform, and Docker.
  • Streamlining and automating deployment processes.
  • Working closely with the Security team to address potential threats and patches.
  • Supporting engineers and product teams to improve overall scalability, stability, and reliability.

AWSPythonKubernetesGoCI/CDLinuxTerraform

Posted 6 days ago
Apply
Apply

πŸ“ California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

πŸ’Έ 117600.0 - 252000.0 USD per year

πŸ” Software Development

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff almost 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
  • Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
  • Solid experience with at least one programming language: Go, Ruby or Python.
  • Advanced experience with Linux.
  • Extensive on-call experience as an SRE supporting mission critical systems.
  • Solid incident management experience across all phases.
  • Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.
  • Design, build, and maintain ClickHouse and PostgreSQL clusters.
  • Provision cloud infrastructure using configuration management and IaC tools.
  • Implement high-availability ClickHouse solutions.
  • Optimize PostgreSQL clusters for core applications.
  • Build monitoring and alerting tools to ensure resource optimization.
  • Respond to platform alerts and user emergencies.
  • Enhance infrastructure security and partner with compliance assessors.
  • Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 7 days ago
Apply
Apply

πŸ“ United Kingdom

🧭 Contract

πŸ” Restaurant industry

NOT STATED
  • Reporting to the Site Reliability Team Lead, work closely with Engineering and Product Managers.
  • Improve system availability.
  • Sharpen execution skills to provide an amazing experience for customers.

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 13 days ago
Apply
Apply

πŸ“ Colombia, USA

🧭 Contractor

πŸ” Software outsourcing

🏒 Company: Teravision TechnologiesπŸ‘₯ 251-500πŸ’° about 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

  • Proven experience managing the Kubernetes infrastructure.
  • Familiarity with CI/CD pipelines, particularly TeamCity and tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, etc.
  • Strong understanding of backend systems and infrastructure management.
  • Excellent English communication skills and a Bachelor’s Degree in Computer Science or equivalent work experience.
  • Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role and knowledge of monitoring and alerting tools to support on-call responsibilities.

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 15 days ago
Apply
Apply

πŸ“ Spain

🧭 Full-Time

πŸ’Έ 72000.0 EUR per year

πŸ” Mobility and transportation

  • Think Unix, you know the networking stack, the OSI model, containers (and schedulers), and you know your way around monitoring, logging and the CAP theorem (bonus!).
  • Have strong programming skills in at least one language, and know your way around a few more or can learn them if the opportunity arises.
  • Automate yourself out of everything by nature, making machines do the toil.
  • Communicate effectively and asynchronously.
  • Care about the things that affect the company, your team, and yourself.
  • Embrace diversity and humbleness (and a bit of trolling).
  • Prefer taking iterative action over waiting for things to happen or to be perfect.
  • Strongly favor simplicity over complexity. Ie, adhering to the KISS principle.
  • Have a sense for identifying, exploiting and elevating bottlenecks.
  • Are not afraid of expressing yourself in English.
  • Evolving our infrastructure platform building self-service components that will be used by all the engineering team and by millions of users around the world.
  • Working closely with our Product and Infrastructure teams to architecture and develop world-class infrastructure components.
  • Designing and implementing tooling to improve the availability, scalability, observability and latency of our services, which are used by internal customers to deploy and operate their services.
  • Increasing reliability awareness with other teams, helping with the adoption of reliability principles and reviewing observability implementations or software architectures.
  • Defining SLIs, SLOs and SLAs as part of the services' lifecycle.
  • Sharing an on-call schedule for the platform services you own.
  • Solving problems in our highly available platform together with other teams, then build automations to prevent incidents from happening again.
  • Participating in our recruiting process to help grow our engineering team.

DockerPythonAWS EKSGitJavascriptKubernetesRubyGoGrafanaPrometheusCI/CDLinuxMicroservices

Posted 16 days ago
Apply
Apply

πŸ“ Canada, Chile

πŸ” Technology

🏒 Company: Launchpad Technologies

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Familiarity with monitoring tools and systems.
  • Proficient in scripting languages such as Python, Bash, or Ruby.
  • Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
  • Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
  • Excellent troubleshooting and analytical skills.
  • Strong communication skills and the ability to work effectively within a team.
  • Develop, maintain, and improve automated deployment, certification, and validation pipelines.
  • Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
  • Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
  • Manage third-party services and technologies used to support the SRE discipline.
  • Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
  • Define and implement an observability framework to provide insights into system performance and behavior.
  • Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
  • Own operational incident management, providing support to related teams and individuals during incident resolution.
  • Identify and implement best practices for system reliability, security, scalability, and performance.
  • Participate in on-call rotations for system support, troubleshooting, and resolution.
  • Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
  • Develop and maintain documentation for systems, processes, and procedures.

AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Posted 2 months ago
Apply
Apply

πŸ“ Spain

🧭 Full-Time

πŸ” Mobility services

🏒 Company: CabifyπŸ‘₯ 1001-5000πŸ’° $16,473,668 Debt Financing about 1 year agoInternetLogisticsRide SharingTransportationMobile

  • Strong knowledge of Unix, networking stack, OSI model, containers, and monitoring.
  • Programming skills in at least one language; capability to learn others.
  • Natural tendency to automate tasks.
  • Effective and asynchronous communication skills.
  • Care for the company, team, and self.
  • Embrace diversity and humility.
  • Action-oriented and iterative problem solving.
  • Preference for simplicity over complexity.
  • Ability to identify and address bottlenecks.
  • Proficiency in English communication.
  • Evolving our infrastructure platform building self-service components.
  • Working closely with Product and Infrastructure teams to develop infrastructure components.
  • Designing and implementing tooling for service availability, scalability, observability, and latency improvements.
  • Increasing reliability awareness with teams and reviewing implementations.
  • Defining SLIs, SLOs and SLAs as part of services' lifecycle.
  • Sharing an on-call schedule for owned platform services.
  • Solving problems in a highly available platform and building automations to prevent incidents.
  • Participating in the recruiting process to grow the engineering team.

AWSAWS EKSKubernetesMicroservicesNetworking

Posted 2 months ago
Apply
Apply

πŸ“ US and Canada

🧭 Full-Time

πŸ’Έ 150000 - 200000 USD per year

πŸ” Healthcare

🏒 Company: Synthesis HealthπŸ‘₯ 51-100πŸ’° Seed about 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.
  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 3 months ago
Apply