Senior Database Reliability Engineer (DBRE) & Architect

New

CloudlinuxLinux Infrastructure

WorldwideFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSPostgreSQLPythonApache AirflowGCPJenkinsKafkaKubernetesMongoDBAzureClickhouseGoGrafanaRedisTerraformAnsibleGitLab

Deep PostgreSQL expertise (5+ years), including MVCC internals, locking mechanics, Patroni, PgBouncer, and seamless major version upgrades under load.
ClickHouse mastery: experience operating large clusters, understanding ZooKeeper/ClickHouse Keeper, sharding, replication internals, and diagnosing performance issues at the data-part level.
Engineering Mindset (SRE/DevOps) with experience writing complex Terraform modules and Ansible roles.
Programming skills in Python or Go for automation.
Experience in Hybrid Environments, understanding differences between Bare Metal, Kubernetes, Cloud, and optimizing TCO/disk subsystem performance (NVMe, Network Storage).
Systems approach, understanding security (FIPS, Audit logs) and Disaster Recovery.
Openness to modern workflows and integrating AI into day-to-day operations.
Experience building an Internal Developer Platform (IDP) (Nice to Have).
Experience operating databases in Kubernetes (CloudNativePG, Altinity Operator) (Nice to Have).
Experience working in Cloud and Hosting providers on similar services (Nice to Have).

Design and implement a self-service platform based on Terraform and Ansible for deploying HA clusters (PostgreSQL, ClickHouse, MongoDB, Redis) across heterogeneous environments (Bare Metal, OpenNebula, Kubernetes, Public Clouds).
Manage and scale exponentially growing ClickHouse analytics clusters (12+ clusters, tens of terabytes of data), addressing sharding, table engine optimization, and building reliable S3 backup pipelines.
Maintain and scale infrastructure for Apache Airflow and Redash, ensuring reliability of ETL pipelines and visualization tools.
Implement SRE practices in data management, replacing manual incident response with automated self-healing mechanisms and defining/implementing SLO/SLI for all databases.
Lead the migration process from legacy solutions to modern cloud patterns and participate in decisions regarding Kubernetes operators for stateful workloads.
Serve as the technical authority for product teams, helping them optimize data schemas and SQL queries for high-load systems.

View Full Description & ApplyYou'll be redirected to the employer's site