Apply

Site Reliability Engineer (SRE)

Posted 15 days agoViewed

View full description

πŸ” Industry: Software Development

🏒 Company: Procurement Sciences

Requirements:
  • Proficient in Kubernetes, Helm, and troubleshooting in secure environments with limited or no remote access.
  • Expertise in observability and monitoring tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Experience with cloud providers, particularly Azure and Azure Gov.
  • Strong understanding of microservices architecture, including Postgres and AI systems.
  • Expertise in automated testing frameworks and tools (e.g. integrated tests, synthetic tests, load testing, etc.).
  • Experience with monitoring and analytics tools to track SLIs, SLAs, and SLOs.
  • Excellent problem-solving skills and attention to detail. Tenacious attitude.
  • Strong communication skills, with the ability to work effectively in a collaborative environment.
  • Proficiency in programming languages such as TypeScript and Python.
  • Strong scripting skills in Bash, PowerShell, or similar languages.
  • Experience with Infrastructure as Code (IaC) tools like Azure Bicep, AWS CDK, or Terraform
  • Understanding of networking principles and experience with network troubleshooting.
  • Strong communication and collaboration skills, with the ability to work effectively with both technical and non-technical personnel.
Responsibilities:
  • Perform root cause analysis to identify and resolve system or application issues in a timely and effective manner, often in communication with developers.
  • Design and implement a broad range of automated tests to ensure system reliability and performance.
  • Building scalable and cost-effective observability patterns in Datadog or other monitoring providers.
  • Monitor and analyze SLIs to ensure adherence to SLAs and SLOs.
  • Collaborate with development and operations teams to improve system reliability and developer experience (DevEx).
  • Develop and maintain monitoring and alerting systems to proactively address issues.
  • Implement best practices for incident management and disaster recovery.
  • Respond to and manage incidents, performing post-mortem analyses to prevent recurrence.
  • Plan and implement capacity upgrades, ensuring scalability and performance.
  • Automate repetitive operational tasks and develop tools for system automation.
  • Define, monitor, and manage SLAs, ensuring service levels meet or exceed expectations.
  • Ensure systems comply with security and regulatory requirements.
  • Identify areas for continuous improvement in systems and processes.
  • Create and maintain documentation for systems, processes, and incident responses.
Apply

Related Jobs

Apply

πŸ“ Cyprus, Montenegro, Georgia, Serbia, Poland

πŸ” Software Development

🏒 Company: Cloudlinux

  • Strong background in development: an ideal candidate had started a career as a developer, then rolled to infrastructure-based projects on a large scale.
  • Proven experience as a leading SRE or in a similar role, with a strong focus on Linux environments.
  • Proficiency in modern agile SDLC practices and principles, orchestration, and CI/CD tooling i.e. Python, Java, Terraform, Ansible, Cloudformation, Puppet, Chef, or similar.
  • Knowledge of the Grafana ecosystem or similar, building dashboards, alert rules, PromQL, as well as frontend observability.
  • Excellent technical knowledge of IT Infrastructure, including network and application load balancers, switches, routers, and IP addressing.
  • Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
  • Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.
  • Design, implement, and manage scalable, resilient, and secure wide company repository infrastructure for CloudLinux products as a first assignment.
  • Automate software operations for re-usability and consistency across private and public clouds, taking into consideration the complexities of distributed systems.
  • Monitor system performance and troubleshoot issues proactively to ensure optimal uptime and reliability.
  • Automate deployment processes using Infrastructure as Code (IaC) principles.
  • Share your experience, know-how, and best practices with other team members in design sessions, system architecture discussions, mentorship, and "doing work together".

PythonBashCloud ComputingKubernetesNginxGrafanaPrometheusRelease ManagementCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 23 days ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 165000.0 - 205000.0 USD per year

πŸ” Software Development

🏒 Company: CriblπŸ‘₯ 251-500πŸ’° $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 24 days ago
Apply
Apply

πŸ“ Egypt, Saudi Arabia

πŸ” Cloud Engineering

🏒 Company: LucidyaπŸ‘₯ 51-100πŸ’° $6,000,000 Series B about 3 years agoArtificial Intelligence (AI)Social CRMSocial Media ManagementSaaSMachine LearningAnalytics

  • Approximately 3 years of experience in SRE, DevOps, or Infrastructure Engineer roles.
  • Strong experience with at least one major cloud provider such as AWS, GCP, or Azure.
  • Hands-on experience with Kubernetes and containerization tools.
  • Proficient in scripting languages such as Python, Bash, or similar.
  • Familiarity with Infrastructure as Code (IaC) tools like Terraform, Pulumi, or AWS CloudFormation.
  • Strong understanding of networking concepts and HA architecture.
  • Experience with CI/CD tools.
  • Experience with modern monitoring and observability tools.
  • Strong analytical skills for troubleshooting complex issues.
  • Ensure high availability (HA) and scalability of critical infrastructure components.
  • Identify and eliminate single points of failure across the cloud environment.
  • Manage and optimize cloud-based workloads.
  • Automate provisioning, scaling, and maintenance tasks using IaC tools.
  • Manage Kubernetes clusters and related operations.
  • Implement monitoring solutions and participate in incident response.
  • Develop automation scripts to reduce manual efforts and advocate for configuration automation.

AWSDockerPythonBashGCPKubernetesRabbitmqAzureGrafanaPrometheusRedisCI/CDTerraform

Posted 3 months ago
Apply
Apply

πŸ“ India

🧭 Full-Time

πŸ” DevSecOps

🏒 Company: InOrg GlobalπŸ‘₯ 251-500Artificial Intelligence (AI)DevOpsVirtual WorkforceConsultingIT ManagementHuman ResourcesMachine LearningCyber SecuritySoftware

  • Minimum 3+ years of experience in Site Reliability Engineering, DevOps, or a related role.
  • Proficiency in the ELK stack (Elasticsearch, Logstash, Kibana) for log monitoring.
  • Experience with the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor) for metrics monitoring.
  • Strong scripting skills in languages such as Python, Bash, or Ruby.
  • Understanding of Operating System: Ubuntu(OpenStack) - Must have, Debian and Redhat etc.,
  • DevOps Platforms Gitlab - Good to have, Or similar
  • Solid understanding of Grafana and Prometheus.
  • Having worked with ServiceNow or something similar.
  • Experience with configuration management tools like Ansible, Puppet, or Chef.
  • Familiarity with containerization and orchestration tools like Docker and Kubernetes.
  • Understanding of cloud platforms (Any of AWS, Azure, or GCP) and their services.
  • Bachelor’s degree in computer science, Information Technology, or a related field.
  • Excellent problem-solving skills and attention to detail.
  • Strong communication and collaboration abilities.
Ensuring the reliability, availability, and performance of our customer’s platforms and services, bridging the gap between development and operations.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusAnsible

Posted 3 months ago
Apply
Apply

🧭 Full-Time

πŸ’Έ 175000.0 - 185000.0 USD per year

πŸ” Software Development

  • 7+ years of experience as a software engineer with 5 years as an SRE supporting Infrastructure, Networking, and Application Operations in a high availability, 24x7 hybrid environment (Colo/Cloud).
  • Strong record of automation (e.g., Python, Bash, Ansible, Terraform, CloudFormation).
  • Strong experience with AWS cloud infrastructure and container orchestration (Kubernetes, ArgoCD) operating in a GitOps framework.
  • Strong experience with application monitoring, observability, and alerting systems (e.g., New Relic, Grafana).
  • Strong experience with at least one programming language (Python, Java).
  • Advanced experience with Linux system administration, Java-based applications, and network architecture.
  • Ability to participate in architecture reviews.
  • Design and implement highly automated systems/services that ensure the availability, reliability, and scalability of infrastructure and applications.
  • Build and maintain monitoring and alerting to provide timely feedback on the performance and health of systems, network, and applications.
  • Work with software development to design and implement systems/applications that are resilient to failure and highly scalable.
  • Achieve material application performance improvements based on insights from observability metrics.
  • Develop and maintain disaster recovery plans and procedures.
  • Participate in on-call rotations to ensure 24/7 application availability.
  • Triage incoming Web Support escalation requests.
  • Drive incident root cause analysis, service restoration, and serve as an incident commander during outage events.
Posted 5 months ago
Apply
Apply

πŸ“ US, Portugal

🧭 Full-Time

πŸ” Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.
  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 5 months ago
Apply

Related Articles

Posted about 1 month ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 8 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.