Apply

Senior Site Reliability Engineer

Posted 7 days agoViewed

View full description

📍 Location: Colombia, USA

🔍 Industry: Software outsourcing

🏢 Company: Teravision Technologies👥 251-500💰 about 13 years agoAndroidiOSMobile AppsInformation TechnologySoftware

🗣️ Languages: English

🪄 Skills: AWSKubernetesCI/CDTroubleshootingDebugging

Requirements:
  • Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
  • Hands-on experience with AWS services such as S3, Route 53, and others.
  • Strong understanding of backend systems and infrastructure management.
  • Excellent English communication skills are a must.
  • Bachelor’s Degree in Computer Science or equivalent work experience.
Responsibilities:
  • Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
  • Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
  • Prior experience in an on-call role and knowledge of monitoring and alerting tools.
Apply

Related Jobs

Apply

📍 Americas

🧭 Contract

🔍 Digital paper and learning solutions

  • Strong experience working in an AWS-hosted environment.
  • Experience in supporting production workloads and firefighting.
  • Knowledge of SRE best practices and common issues.
  • Experience with system monitoring tools.
  • Understanding and experience with distributed databases.
  • Solid understanding of Linux and Networking fundamentals.
  • Background in back-end development, including API usage and creation.
  • Knowledge of Security for network and containers.
  • Understanding of container orchestration, especially Kubernetes.
  • Experience managing Relational and Non-relational databases, including backup and restore operations.
  • Familiarity with automation/configuration management tools, preferably CDK or Terraform.
  • Design, build, and maintain the Goodnotes infrastructure, adhering to Dickerson's Hierarchy of Reliability.
  • Design, refine, and execute new and existing playbooks.
  • Educate various teams in SRE best practices, assisting in design and capacity planning.
  • Serve as the go-to person for higher-level escalation for applications.
  • Improve SLAs, optimize latency and error rates.
  • Enhance system monitoring, health reporting, and logging.
  • Implement and maintain security practices.

AWSPostgreSQLKotlinKubernetesMongoDBTypeScriptGoCI/CDLinuxTerraformMicroservicesNetworking

Posted about 1 month ago
Apply
Apply

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health👥 51-100💰 Seed about 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.
  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2 months ago
Apply
Apply

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.
  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 3 months ago
Apply
Apply

📍 United States

💸 130000 - 170000 USD per year

🔍 Data-Powered Marketing Cloud

🏢 Company: Zeta Global👥 1001-5000💰 $105,263,174 Post-IPO Equity 5 months agoInformation ServicesAdvertisingAnalyticsMarketing

  • 7+ years of experience as an SRE.
  • 3+ years of software development experience, emphasizing automation.
  • Hands-on experience with Infrastructure as Code (IaC) tools.
  • Experience with distributed systems and microservices architecture.
  • Production experience with distributed tracing.
  • Proficiency in Python and Bash scripting.
  • Solid understanding of SLIs, SLOs, and error budgets.
  • Experience with CI/CD platforms like GitOps or Jenkins.
  • Expertise in incident management and root cause analysis.
  • Knowledge of modern deployment strategies like Canary and Blue-Green.
  • Familiarity with resiliency patterns such as circuit breakers and load balancing.
  • Experience with SQL and NoSQL databases in distributed systems.
  • Proficiency in statistical analysis related to metrics.
  • Experience with high-performance and low-latency systems.
  • Experience with cloud cost optimization strategies.
  • Familiarity with distributed messaging systems like Kafka.
  • Strong understanding of security and compliance standards in SRE.
  • Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
  • Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
  • Analyze historical data to identify areas for improvement.
  • Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
  • Reduce toil through runbook automation and record key MTTx metrics.
  • Lead design sessions focusing on capacity planning and automation.
  • Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 3 months ago
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.
  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 3 months ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.
  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 4 months ago
Apply
Apply

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly👥 11-50💰 $12,000,000 Series A over 1 year agoDeveloper ToolsDeveloper PlatformProductivity ToolsSaaSInformation TechnologySoftware

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.
  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 5 months ago
Apply