- Responsible to design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL which supports a growing 24x7 SaaS platform.
- Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness.
- Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events.
- Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations.
- Provides escalated technical guidance and support to other technology teams throughout the organization
- Provides on-call coverage for production support and other duties as required.
- Accountable for complying with HIPAA security policies within the database platform
- Ensure all solutions and operational activities adhere to the security and operating policies established by the organization
- Own and continuously improve our Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack (e.g., Prometheus, Grafana, New Relic, or equivalent).
- Automate system maintenance tasks using Bash, Powershell, Python, or Ansible. Manage infrastructure as code (IaC) writing Ansible playbooks.
PostgreSQLPythonAgile+11 more