Apply🧠Full-Time
🔍 Software Development
- Proficient in Kubernetes, Helm, and troubleshooting in secure environments with limited or no remote access.
- Expertise in observability and monitoring tools such as Prometheus, Grafana, ELK Stack, or Datadog.
- Experience with cloud providers, particularly Azure and Azure Gov.
- Strong understanding of microservices architecture, including Postgres and AI systems.
- Expertise in automated testing frameworks and tools (e.g. integrated tests, synthetic tests, load testing, etc.).
- Experience with monitoring and analytics tools to track SLIs, SLAs, and SLOs.
- Excellent problem-solving skills and attention to detail. Tenacious attitude.
- Strong communication skills, with the ability to work effectively in a collaborative environment.
- Proficiency in programming languages such as TypeScript and Python.
- Strong scripting skills in Bash, PowerShell, or similar languages.
- Experience with Infrastructure as Code (IaC) tools like Azure Bicep, AWS CDK, or Terraform
- Understanding of networking principles and experience with network troubleshooting.
- Strong communication and collaboration skills, with the ability to work effectively with both technical and non-technical personnel.
- Perform root cause analysis to identify and resolve system or application issues in a timely and effective manner, often in communication with developers.
- Design and implement a broad range of automated tests to ensure system reliability and performance.
- Building scalable and cost-effective observability patterns in Datadog or other monitoring providers.
- Monitor and analyze SLIs to ensure adherence to SLAs and SLOs.
- Collaborate with development and operations teams to improve system reliability and developer experience (DevEx).
- Develop and maintain monitoring and alerting systems to proactively address issues.
- Implement best practices for incident management and disaster recovery.
- Respond to and manage incidents, performing post-mortem analyses to prevent recurrence.
- Plan and implement capacity upgrades, ensuring scalability and performance.
- Automate repetitive operational tasks and develop tools for system automation.
- Define, monitor, and manage SLAs, ensuring service levels meet or exceed expectations.
- Ensure systems comply with security and regulatory requirements.
- Identify areas for continuous improvement in systems and processes.
- Create and maintain documentation for systems, processes, and incident responses.
Posted 14 days ago
Apply