Apply

Senior Site Reliability Engineer

Posted over 1 year agoViewed

View full description

๐Ÿ” Industry: Financial risk management

๐Ÿ—ฃ๏ธ Languages: English

๐Ÿช„ Skills: DockerPythonKubernetesC (Programming language)

Requirements:
  • A bachelor's degree in computer science, information systems, or the equivalent combination of education, experience, and training
  • Fluency in english, both written and spoken
  • 4+ years of experience with aws or azure
  • Experience with automation, infrastructure-as-code, terraform, ansible, runbooks and troubleshooting guides
  • Experience with virtualization, container technologies and orchestration (docker, kubernetes)
  • Programming skills (go, python, or similar languages)
  • Experience with ci/cd pipelines
  • Experience with monitoring, troubleshooting and guiding on incidents
  • Self-driven & motivated, with a strong work ethic and a passion for problem-solving;
Responsibilities:
  • Build and maintain tools for deployment, monitoring, operations, and analytics
  • Development with go, python, or similar languages
  • Document and guide engineers through playbooks and troubleshooting guides
  • Contribute to the applications self-healing in a cloud-based environment
  • Leverage, configure and troubleshoot cloud resources in aws
  • Migrate and operate workloads in kubernetes
  • Participate in incident response, root cause investigation, and resolution
  • Maintain and develop our infrastructure as code (iac) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
  • Utilize your experience and problem-solving skills to help prevent and investigate production issues
  • Communicate with team members and stakeholders in a globally distributed and asynchronous environment
  • Investigate, describe, and drive improvements on current infrastructure, promoting evolution and sharing knowledge amongst the team;
Apply

Related Jobs

Apply

๐Ÿ“ USA

๐Ÿงญ Full-Time

๐Ÿ’ธ 186065.0 - 218900.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: Coinbase Careers Page๐Ÿ‘ฅ 1000-5000

  • 5+ years of experience building, iterating upon, and maintaining corporate IAM systems
  • 5+ years of experience with operational procedures and application development
  • Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
  • Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
  • Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
  • Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
  • Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
  • Strong understanding of CI/CD workflows, automation frameworks, and best practices
  • Clear communicationโ€”demonstrate ability to explain technical concepts simply
  • Self starterโ€”possess a continuous learning mindset
  • Demonstrate critical thinking under pressure
  • Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
  • Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
  • Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
  • Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
  • Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
  • Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
  • Deliver configurations and maintain state using configuration management tools
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives
  • Define metrics and bolster monitoring/observability across corporate IAM systems
  • Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted about 10 hours ago
Apply
Apply

๐Ÿ“ Americas, EU, UK

๐Ÿ” Cryptocurrency

๐Ÿข Company: Auros๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the teamโ€™s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firmโ€™s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 1 day ago
Apply
Apply

๐Ÿ“ Brazil

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: Supermetrics๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $47,174,818 Series B over 4 years agoSaaSAnalyticsB2BMarketingEnterprise SoftwareSoftware

  • 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles
  • In-depth understanding of containers and experience operating Kubernetes clusters at scale.
  • Experience operating databases in production
  • Proficient in database concepts with practical experience in both relational and NoSQL databases.
  • In-depth knowledge of Linux systems and Terraform.
  • In-depth experience and understanding of AWS and GCP
  • Solid understanding of modern observability practices and tools
  • Automation mindset with the ability to automate repetitive tasks using scripting languages such as Python or Bash.
  • Collaborative approach to working with others
  • Willing to take on-call rotations during non-business hours
  • Good communication skills, in particular in writing (documentation, but able to write good PRs too)
  • Skilled problem-solving abilities with a keen interest in the tools, technologies and problems in this space
  • A developer background and the ability to write CLIs and other tools in Go, Python, or Rust.
  • Security mindset with experience implementing security best practices in platform and operational contexts.
  • Experience in creating and managing Helm charts.
  • Expert knowledge of continuous integration and continuous deployment (CI/CD) systems and processes and experience developing and maintaining GitHub Actions.
  • Write Terraform configuration and modules that bootstrap a Kubernetes cluster, or review PRs with contributions from other members, making sure that our modules are truly reusable and well-defined, improving how we test and release them.
  • Write (using Golang, for example) and maintain or improve our tooling, ensuring it facilitates platform utilization by engineering teams.
  • Develop and maintain Helm charts for internal deployments and third-party software.
  • Respond to an incident with our production environment.
  • Support our pre-sales team and assist them in answering potential customers' questions on our architecture and how we guarantee data security or consistency or ensure uptime.
  • Review an architecture change involving a new database and take part in the meetings discussing the pros and cons of such an approach.
  • Rewrite a Github Action to improve how we deploy to Kubernetes using GitOps.
  • Fix technical issues as they arise.
  • Participate in our on-call rotations to provide support, respond to incidents, or handle internal users' questions.

AWSDockerPostgreSQLPythonSQLBashElasticSearchGCPKubernetesMySQLClickhouseGrafanaPrometheusRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleSoftware Engineering

Posted 6 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ’ธ 117725.0 - 138465.0 USD per year

๐Ÿ” Software Development

  • 3+ years of experience in software engineering, with 2 years experience in DevOps
  • Cloud Provider (AWS, GCP, Azure) experience on managing resources through Infrastructure As Code (Terraform)
  • Container Orchestration (ECS or K8s) experience to confidently build, test, and release containerized applications for multiple environments and regions
  • Knowledge of Observability best practices across common cloud resources (EC2, ECS, RDS, DynamoDB, S3, SQS, Eventbridge) with experience on rolling out enhancements across a distributed platform with scale in mind
  • Experience with shell scripting for *nix systems
  • Experience with Networking for web applications
  • Effective at communicating ideas through writing and diagramming
  • Comfortable working with a distributed development and ops team
  • Familiarity with AWS: ECS and cloud hosting, Gitlab: CI/CD, Python: Django, Flask, aiohttp, Bash, Data: PostgreSQL, Redis, Monitoring: Datadog and Sentry, IaC: Terraform, Packer
  • Manage and orchestrate Cloud Resource (AWS) configuration using Infrastructure As Code (Terraform) to empower engineering staff to embrace a DevOps culture of Self Service Ownership
  • Develop and govern Observability (Datadog) best practices for tracking platform performance and health trends to meet customer SLAs and lead technical decisions with strong supporting evidence
  • Create solutions that dynamically scale based on demand with enough flexibility to pivot for fast changing project requirements while maintaining a balance of good versus perfect
  • Provide strong and consistent communication updates on technical progress or blockers to keep stakeholders informed while additionally creating appropriate documentation on technical design to spread knowledge and reduce information silos
  • Participate and respond to 24/7 on-call critical alerts and follow documented incident investigation procedures to reestablish customer facing feature availability
  • Maintain HIPAA, GDPR, SOC-2 compliance and general security through best practice implementation

AWSPostgreSQLPythonBashCloud ComputingGitKubernetes*NixRedisCI/CDDevOpsTerraform

Posted 7 days ago
Apply
Apply

๐Ÿ“ Australia, New Zealand

๐Ÿ” Software Development

Full working rights and residency in Australia or New Zealand.
Significantly impact the developer experience, enabling developers across our R&D teams to ship scalable applications with high speed, quality, and performance.

DockerKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNodeJSScripting

Posted 8 days ago
Apply
Apply

๐Ÿ“ United States, Europe

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: Dune๐Ÿ‘ฅ 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 12 days ago
Apply
Apply

๐Ÿ“ Germany, Spain, Portugal

๐Ÿข Company: Jobgether๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $1,493,585 Seed about 2 years agoInternet

  • 5+ years of experience in a Site Reliability Engineer or similar role.
  • 3+ years of experience with AWS services and container orchestration tools.
  • 2+ years of Kubernetes experience.
  • Strong knowledge of observability tools and principles (monitoring, logging, tracing).
  • Hands-on experience with Terraform for infrastructure as code.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience in incident management, postmortem analysis, and risk mitigation.
  • Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
  • Develop and maintain systems that are reliable, scalable, and efficient.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
  • Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
  • Automate operational tasks, incident responses, and contribute to system performance optimizations.
  • Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
  • Continuously evaluate and improve system performance, capacity, and cost efficiency.
  • Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 13 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿ” Blockchain

๐Ÿข Company: IO Global

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 13 days ago
Apply
Apply

๐Ÿ“ Cyprus, Montenegro, Georgia, Serbia, Poland

๐Ÿ” Software Development

๐Ÿข Company: Cloudlinux

  • Strong background in development: an ideal candidate had started a career as a developer, then rolled to infrastructure-based projects on a large scale.
  • Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
  • Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
  • Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
  • Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
  • Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
  • Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.
  • Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
  • Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
  • Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
  • Automate deployment processes using Infrastructure as Code (IaC) principles.
  • Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".

PythonBashCloud ComputingKubernetesNginxGrafanaPrometheusRelease ManagementCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 15 days ago
Apply
Apply

๐Ÿ“ United Kingdom

๐Ÿ” Software Development

๐Ÿข Company: StarRez๐Ÿ‘ฅ 251-500๐Ÿ’ฐ Private about 3 years agoConsultingSaaSProperty ManagementSoftware

  • Bachelor's degree in Computer Science, Information Technology, or similar
  • 1+ years experience working on a SaaS platform
  • Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
  • Proficiency in at least one (or more) object-oriented programming language (C# preferable)
  • Production experience operating containerization technologies (Kubernetes).
  • Proficiency with one or more public cloud providers such as Azure, AWS or GCP
  • Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
  • Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
  • Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
  • Proven track record of maintaining highly-available and performant production environments.
  • Ability to identify and implement effective mitigation strategies and operational playbooks.
  • Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
  • Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
  • Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
  • Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
  • Participate in on-call rotations to ensure system reliability and rapid incident response.
  • Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
  • Conduct performance tests to identify and remediate bottlenecks
  • Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
  • Monitor, review and tune databases to ensure high availability and performance
  • Collaborate with product engineering teams to design/build fit-for-purpose and observable software
  • Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 15 days ago
Apply