Apply

Senior Site Reliability Engineer (SRE)

Posted 2024-11-22

View full description

💎 Seniority level: Senior

🔍 Industry: Commission management

Requirements:
  • Thoughtful and pragmatic engineers.
  • Balance between executing tasks correctly and rapidly.
  • Invest in iterative efforts for continuous improvement.
  • Strong emphasis on written communication for institutional knowledge.
Responsibilities:
  • Operate across the engineering organization to support development teams.
  • Provide tools and processes necessary for effective work.
  • Ensure high-quality service for customers and communicate issues timely.
  • Deliver infrastructure, platform, reliability, and observability support.
Apply

Related Jobs

Apply

📍 Poland

🔍 IT and Security

  • Extensive experience with enterprise scale continuous delivery environments
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with sustainable incident response in a blameless environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Comfortable with a high level of autonomy and working with a distributed team

  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSNode.jsDesign PatternsJavascriptKibanaTypeScriptJavaScriptGrafanaPrometheusLinuxTerraform

Posted 2024-11-27
Apply
Apply

🧭 Full-Time

🔍 Software

  • Degree in Computer Science, Information Technology, or related field.
  • 5+ years of hands-on experience in site reliability engineering with a focus on disaster recovery.
  • Strong expertise in designing disaster recovery solutions.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code tools like Terraform or CloudFormation.
  • Excellent communication skills to collaborate with teams and explain technical concepts.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environment.
  • Ensure rapid recovery in system failures or disasters.
  • Develop and document disaster recovery plans and procedures.
  • Conduct drills and exercises to validate disaster recovery plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate risks.
  • Monitor system performance and identify areas for improvement.
  • Participate in incident response and post-incident reviews.
Posted 2024-11-21
Apply
Apply

🧭 Full-Time

🔍 Software

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, focusing on disaster recovery.
  • Strong expertise in designing and implementing disaster recovery solutions.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Excellent communication skills to collaborate with cross-functional teams.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environment.
  • Develop and document disaster recovery plans, procedures, and runbooks.
  • Conduct drills to test and validate the effectiveness of recovery plans.
  • Collaborate with teams to identify and mitigate risks to system availability.
  • Monitor system performance and implement measures for reliability and resilience.
  • Participate in incident response and post-incident analysis to prevent recurrence.
Posted 2024-11-20
Apply
Apply

🧭 Contract

  • Minimum of 5-7 years in Site Reliability Engineering or related fields.
  • Proven experience in designing and implementing fault-tolerant, scalable systems at an enterprise level.
  • Deep understanding of DFR, FMEA, MTBF, and other reliability methodologies.
  • Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage, pipeline deployment processes, and rollback procedures.
  • Strong coding skills in one or more programming languages commonly used in SRE.
  • Exceptional analytical skills to investigate complex issues and devise effective solutions.
  • Willingness to learn new products and tools provided by the company.
  • Excellent communication skills and ability to work effectively within a distributed team environment.

  • Identify and resolve complex bugs by working within the codebase and utilizing runbooks.
  • Write and maintain code to enhance system reliability, scalability, and performance.
  • Restart services and implement changes to the codebase as required.
  • Investigate complex system issues and develop effective resolutions.
  • Design and build fault-tolerant, scalable systems for high availability and performance.
  • Apply advanced methodologies like Design for Reliability (DFR), Failure Mode and Effects Analysis (FMEA), and Mean Time Between Failures (MTBF).
  • Develop and maintain reliability standards and documentation.
Posted 2024-11-12
Apply
Apply

📍 LATAM

🔍 AI development tools

NOT STATED

  • Responsible for setting up and maintaining infrastructure standards.
  • Play a pivotal role in tool development both externally and internally.
  • Help deploy software to enterprise customers.
  • Establish strong partnerships with enterprise customers to boost satisfaction.
  • Manage variances in infrastructure types and implement suitable solutions.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply
Apply

🧭 Full-Time

🔍 Blockchain technology

🏢 Company: Core Scientific

  • 5+ years’ experience in SRE, DevOps, and/or Infrastructure Engineering.
  • Excellent communication and interpersonal skills; team player.
  • Strong analytical and troubleshooting skills.
  • Ability to define, capture, and interpret product/system requirements.
  • Strong experience using Infrastructure as Code, Configuration Management, & Orchestration tools such as Terraform, Helm, Kustomize, and Ansible.
  • Strong understanding of cloud environments, primarily AWS.
  • Strong experience with Kubernetes and virtualization.
  • Strong experience with build and release management using Github Actions, makefiles, and Python scripts.
  • Strong experience with telemetry, including metrics, logs, and traces.
  • Intermediate scripting ability in Bash, Python, and Make.
  • Basic networking knowledge including OSI model, TCP/UDP, DHCP, DNS, routing, and HTTP.

  • Define, capture, and interpret product/system requirements.
  • Build, integrate, test, monitor and deploy code across cloud and on-premises infrastructure.
  • Write plans, coordinate, and deploy applications via automation.
  • Write documentation to share knowledge with the team.
  • Drive secure, immutable infrastructure through infrastructure as code and security practices.
  • Drive best practices across the Site Reliability Engineering team.
  • Foster open, respectful, and professional communication within the team and across the organization.
  • Perform other duties as assigned.
Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply

Related Articles

Remote Job Certifications and Courses to Boost Your Career

August 22, 2024

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

August 19, 2024

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

August 18, 2024

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

August 16, 2024

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

August 13, 2024

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.