Apply

Senior Site Reliability Engineer (SRE)

Posted 2024-11-07

View full description

💎 Seniority level: Senior, 5+ years

🔍 Industry: Blockchain and Financial Technology

🏢 Company: Core Scientific

🗣️ Languages: English

⏳ Experience: 5+ years

Requirements:
  • 5+ years’ experience in SRE, DevOps, and/or Infrastructure Engineering.
  • Excellent communication and interpersonal skills.
  • Strong analytical and troubleshooting skills.
  • Experience with Infrastructure as Code, Configuration Management, & Orchestration tools such as Terraform, Helm, Kustomize, and Ansible.
  • Understanding of cloud environments, primarily AWS.
  • Experience with Kubernetes and virtualization technologies.
  • Proficiency in build and release management with tools like Github Actions.
  • Understanding of telemetry including metrics, logs, and traces.
  • Intermediate scripting skills in Bash, Python, and Make.
  • Basic knowledge of networking protocols.
Responsibilities:
  • Define, capture, and interpret product/system requirements.
  • Build, integrate, test, monitor, and deploy code across cloud and on-premises infrastructure.
  • Write plans, coordinate, and automate application deployment.
  • Document processes and share knowledge with the team.
  • Promote secure, immutable infrastructure through best practices.
  • Encourage effective communication within the team and across the organization.
  • Perform additional duties as assigned.
Apply

Related Jobs

Apply

🧭 Full-Time

🔍 Software / SaaS

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Experience in a cloud-based SaaS environment.
  • Strong expertise in designing and implementing disaster recovery solutions using industry-leading technologies and methodologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
  • Excellent communication skills with the ability to effectively collaborate with cross-functional teams and communicate technical concepts to non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for cloud-based SaaS environments.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to test and validate the effectiveness of these plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate potential risks to system availability and data integrity.
  • Monitor system performance and health metrics; proactively identify areas for improvement.
  • Implement preventive measures to enhance system reliability and resilience.
  • Participate in incident response and post-incident reviews; analyze root causes of failures.
  • Implement corrective actions to prevent recurrence.
Posted 2024-11-21
Apply
Apply

🧭 Full-Time

🔍 Software Development

  • Degree in Computer Science, Information Technology, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, ideally with a focus on disaster recovery.
  • Strong expertise in designing and implementing disaster recovery solutions using leading technologies.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Excellent communication skills for collaboration with cross-functional teams and non-technical stakeholders.

  • Design, implement, and maintain disaster recovery solutions for a cloud-based SaaS environment.
  • Develop and document comprehensive disaster recovery plans, procedures, and runbooks.
  • Conduct drills and exercises to validate the effectiveness of disaster recovery plans.
  • Collaborate with engineering, operations, and security teams to identify and mitigate risks.
  • Proactively monitor system performance and health metrics, implement preventive measures.
  • Participate in incident response and post-incident reviews to analyze root causes and implement corrective actions.
Posted 2024-11-20
Apply
Apply

🧭 Contract

  • Minimum of 5-7 years experience in Site Reliability Engineering or related fields.
  • Proven experience designing and implementing fault-tolerant, scalable systems.
  • Deep understanding of reliability methodologies like DFR, FMEA, and MTBF.
  • Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage.
  • Strong coding skills in one or more programming languages relevant to SRE.
  • Exceptional analytical skills for complex issue investigation.
  • Willingness to learn new products and tools.
  • Excellent communication skills for a distributed team environment.

  • Identify and resolve complex bugs within the codebase.
  • Enhance system reliability, scalability, and performance through code maintenance.
  • Restart services and implement necessary code changes.
  • Investigate complex system issues and develop resolutions.
  • Design and build fault-tolerant, scalable systems for high availability.
  • Apply methodologies like DFR, FMEA, and MTBF.
  • Develop and maintain reliability standards and documentation.
Posted 2024-11-12
Apply
Apply

📍 LATAM

🔍 AI developer tools

NOT STATED

  • Report to the Enterprise Engineering Manager.
  • Responsible for setting up and maintaining infrastructure standards.
  • Play a pivotal role in tool development externally and internally.
  • Enable deployment of software to enterprise customers.
  • Establish robust technical excellence for a diversified customer base.
  • Manage variances in infrastructure types and implement suitable solutions.
  • Provide high-quality solutions to customers.

LeadershipCloud ComputingGitKubernetesCross-functional Team LeadershipCommunication SkillsAnalytical Skills

Posted 2024-11-10
Apply
Apply

📍 US

🧭 Full-Time

💸 198000 - 220000 USD per year

🔍 Blockchain, Cryptocurrency

🏢 Company: Uniswap Labs

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in site reliability engineering, DevOps, or related fields.
  • Strong understanding of reliability engineering principles and tools.
  • Proficiency in monitoring tools like Prometheus, Grafana, Nagios.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
  • Proficiency in scripting tools such as Python, Bash, Ansible, or Terraform.

  • Design, implement, and maintain systems for reliability, availability, and performance of services.
  • Develop and manage monitoring, alerting, and incident response strategies.
  • Conduct root cause analysis of failures.
  • Collaborate with cross-functional teams on reliability practices.
  • Drive improvements and innovations in systems and processes.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCollaborationCI/CDDevOps

Posted 2024-11-07
Apply
Apply

📍 America

🧭 Contract

🔍 Digital paper solutions and learning ecosystem

🏢 Company: Goodnotes

  • Strong experience working in AWS-hosted environments.
  • Experience supporting production workloads and firefighting.
  • Knowledge of SRE best practices and common issues.
  • Proficient with system monitoring tools.
  • Understanding and experience with distributed databases.
  • Background in Linux and Networking fundamentals.
  • Experience in back-end development, including API usage and creation.
  • Knowledge of Security for networks and containers.
  • Understanding of container orchestration, especially Kubernetes.
  • Experience managing relational and non-relational databases, including backup and restore operations.
  • Familiarity with automation/configuration management tools, preferably CDK and/or Terraform.

  • Design, build, and maintain the Goodnotes infrastructure according to Dickerson’s Hierarchy of Reliability.
  • Refine and execute new and existing playbooks.
  • Educate teams on SRE best practices including design and capacity planning.
  • Act as a higher-level escalation point for applications.
  • Optimize latency and error rates and improve SLAs.
  • Enhance system monitoring, health reporting, and logging.
  • Implement security practices and maintain information security.
  • Participate in on-call rotation during the Americas Timezone.

Linux

Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLJavaScriptAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CD

Posted 2024-11-07
Apply

Related Articles

Remote Job Certifications and Courses to Boost Your Career

August 22, 2024

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

August 19, 2024

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

August 18, 2024

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

August 16, 2024

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Remote Work Statistics and Insights for 2024

August 13, 2024

The article explores the current statistics for remote work in 2024, covering the percentage of the global workforce working remotely, growth trends, popular industries and job roles, geographic distribution of remote workers, demographic trends, work models comparison, job satisfaction, and productivity insights.