Apply

Senior Site Reliability Engineer

Posted 2024-10-05

View full description

πŸ“ Location: Ireland

πŸ” Industry: Enterprise Technology Management

🏒 Company: OomnitzaπŸ‘₯ 51-100πŸ’° $20.0m Series B on 2021-08-26SaaSInformation TechnologyEnterprise SoftwareSoftware

πŸͺ„ Skills: AWSPythonAmazon RDSJavascriptKafkaKubernetesMySQLRabbitmqGoGrafanaPrometheusCollaborationCI/CDJavaScript

Requirements:
  • Extensive experience with container orchestration and managing production clusters, focusing on deployment, scaling, and troubleshooting within Kubernetes environments.
  • Proficiency in tools like Ansible, Helm, and Kustomize for automating infrastructure provisioning, configuration, and deployment.
  • Experience with Prometheus, Grafana, or similar to proactively track system health, detect anomalies, and optimize performance across the platform.
  • Deep knowledge of the AWS ecosystem, including EC2, S3, IAM, VPC, and other essential services for building and managing scalable infrastructure.
  • Hands-on experience with Terraform to provision and manage cloud resources, ensuring version control, repeatability, and efficiency in infrastructure deployment.
  • Familiarity with message queuing systems like RabbitMQ and Kafka, as well as managed queuing services such as AmazonMQ.
  • Strong background in managing MySQL databases and leveraging Amazon RDS for high availability, performance tuning, and secure database management in cloud environments.
  • Understanding of network design and security protocols to protect systems, enforce compliance, and meet industry-standard audit requirements.
  • Experience ensuring high uptime agreements for critical systems, implementing strategies for fault tolerance, disaster recovery, and proactive monitoring to maintain service availability and minimize downtime.
  • Proven ability to work effectively with cross-functional teams from multiple departments to achieve project goals and execute project plans in an orderly and efficient manner.
  • Ability to develop and maintain code in one or more high-level programming languages such as Python, Go, or JavaScript.
Responsibilities:
  • Gather and analyze metrics from our platform and applications to continually improve our performance tuning and fault finding.
  • Partner with our world-class engineering teams to improve services through rigorous testing and release procedures.
  • Create sustainable systems and services through automation and uplifts while working closely with engineering professionals within the company to enable projects to be completed efficiently.
  • Develop, monitor, and manage the entire system landscape by balancing feature development speed and reliability with well-defined service level objectives, ensuring minimal downtime and maximum availability.
  • Participate in the development and implementation of practices, procedures, and technology to ensure our system landscapes are operating within our Security, Compliance, and Availability commitments.
  • Plan, prepare, and execute system upgrades.
  • Mentor and train other engineers throughout the company and seek to continually improve processes company-wide.
Apply

Related Jobs

Apply

πŸ“ Ireland

🧭 Full-Time

πŸ” Travel technology

🏒 Company: Sojern

  • 5+ years of experience as a Site Reliability Engineer, Cloud Engineer, or Software Engineer.
  • Extensive experience building infrastructure on Google Cloud Platform using Terraform.
  • Experience with containerized workloads on Kubernetes, including security and monitoring.
  • Proficient in Python, Shell, or Go.
  • Strong understanding of CS fundamentals, Linux OS, and common GNU/Linux tools.
  • Relevant degree or equivalent experience (BS/MS in CS/EE preferred).

  • Join the Site Reliability Engineering team to establish best practices and evolve the SRE culture.
  • Improve Google Cloud infrastructure and collaborate with Software Engineers to deploy applications.
  • Ensure applications are properly scaled and utilize suitable solutions, whether serverless or containerized.
  • Secure and instrument Kubernetes clusters and related cloud resources.
  • Contribute to revenue-generating features and focus on maintainable, transparent solutions.

PythonJenkinsKubernetesGogRPCServerlessCollaborationCI/CDLinuxTerraform

Posted 2024-11-07
Apply
Apply

πŸ“ US, Europe

🧭 Full-Time

πŸ’Έ 175000 - 210000 USD per year

πŸ” Cloud computing, AI

🏒 Company: CoreWeave

  • You have 5+ years of experience in the software or infrastructure engineering industry.
  • Experience with Python, Go or another scripting language.
  • Experience with how to containerize applications and/or have experience using Kubernetes to manage deployments.
  • Experience with Git.
  • Experience with Linux shell scripting and/or can navigate a *nix-based operating system.
  • Experience creating and maintaining GitHub Actions to automate workflows.
  • You have experience deploying services in production and are interested in learning reliability-at-scale engineering concepts.
  • You have experience refining SDLC, doing code reviews, and providing technical support.

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Streamline repetitive tasks and eliminate bottlenecks to improve development velocity with automated workflows and processes.
  • Partner with developers to understand their pain points and develop tailored solutions that enhance their productivity.
  • Champion best practices and advocate for new tools and technologies to drive ongoing productivity gains.
  • Tackle complex issues related to build systems, testing frameworks, code analysis, and other developer tooling.
  • Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams.

PythonSoftware DevelopmentGitKubernetes*NixGoCollaboration

Posted 2024-11-07
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109000 - 169000 USD per year

πŸ” Nonprofit, Technology

  • Proficient at automation/programming/scripting skills.
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.) as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.).
  • Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting.
  • Experience with managing remotely both bare-metal servers and virtualized environments.
  • 5+ years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience with high traffic and highly available website architectures and operations.
  • Strong English language skills.
  • Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools.
  • B.Sc. or M.Sc. in Computer Science or equivalent work experience.

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments.
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution.
  • Improving observability (alerting, metrics, monitoring) of database infrastructure.
  • Multi-datacenter systems design, capacity and infrastructure planning.
  • Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure and participating in an on call rotation.

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedis

Posted 2024-08-28
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109000 - 169000 USD per year

πŸ” Nonprofit, knowledge sharing

  • Proficient at automation/programming/scripting skills
  • Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.)
  • Advanced knowledge of Linux and IO/data storage concepts
  • Experience with managing remotely both bare-metal servers and virtualized environments
  • 5+ years experience in an SRE/Operations/DevOps role
  • Experience with high traffic and highly available website architectures
  • Strong English language skills
  • Ability to work independently in a fast paced environment

  • Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments
  • Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution
  • Improving observability of database infrastructure
  • Designing multi-datacenter systems, capacity planning, and infrastructure planning
  • Participating in incident response and on-call rotation for system outages or alerts

SQLKibanaC (Programming language)CassandraGrafanaPrometheusRedisLinux

Posted 2024-08-28
Apply
Apply

πŸ“ Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

πŸ’Έ 109047 - 169455 USD per year

πŸ” Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyC (Programming language)Data engineeringGoCommunication SkillsCollaboration

Posted 2024-08-22
Apply