Database Reliability Engineer

TucowsSaaS, Telecoms

CanadaFull-TimeMiddle

Salary126100 - 140100 CAD per year

Apply NowOpens the employer's application page

Job Details

Experience: 7+ years
Required Skills: PostgreSQLPythonSQLBashGoGrafanaPrometheusLinuxTerraformAnsibleDatadog

7+ years of hands-on PostgreSQL experience in large-scale, high-volume production environments
Strong expertise in PostgreSQL internals: WAL, MVCC, vacuum tuning, query planner, indexing, logical replication
Advanced SQL and strong schema design and query optimization skills
Solid experience with Linux systems and networking fundamentals
Experience building automation using Go or Python
Experience with monitoring tools such as Prometheus, Grafana, Datadog, PMM, pg_stat_statements
Deep understanding of PostgreSQL internals: MVCC, WAL processing, vacuum behavior, locking, query planning
Experience designing and operating highly available database clusters with automated failover
Strong performance tuning skills (query optimization, indexing, workload tuning)
Ability to diagnose database and system issues: Query plans, I/O, memory usage, WAL growth, table/index bloat
Experience with backup and recovery strategies: Point-in-time recovery (PITR), durability planning
Familiarity with observability and monitoring: Metrics, alerting, and performance dashboards (Grafana)
Understanding of distributed systems concepts: Service discovery, consensus (e.g., Consul)
Strong Linux systems knowledge (performance tuning, resource management)
Experience with scripting and infrastructure-as-code automation
Strong troubleshooting and problem-solving skills in production environments
Knowledge of Security, compliance, encryption, auditing, access control

Design, implement, and operate highly available PostgreSQL clusters (physical/logical replication, sharding, partitioning, failover automation)
Optimize query performance and indexing strategies
Perform capacity planning, growth forecasting, and workload modeling
Own high-availability strategies, including automatic failover, multi-region deployments, disaster recovery
Build and maintain automation for provisioning, configuration, backups, recovery, failovers, vacuum tuning, schema management
Develop monitoring and alerting systems for PostgreSQL clusters
Lead response during database incidents (e.g., performance regressions, replication lag, deadlocks, bloat, storage failures)
Conduct root-cause analysis and implement long-term fixes
Partner with software engineers to review SQL queries, optimize schemas, and ensure effective use of PostgreSQL features
Provide guidance on database design patterns, migrations and version upgrades, best practices

View Full Description & ApplyYou'll be redirected to the employer's site