Senior Database Reliability Engineer (DBRE) & Architect
New
C
CloudlinuxLinux Infrastructure
WorldwideFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPostgreSQLPythonApache AirflowGCPJenkinsKafkaKubernetesMongoDBAzureClickhouseGoGrafanaRedisTerraformAnsibleGitLab
Requirements
- Deep PostgreSQL expertise (5+ years), including MVCC internals, locking mechanics, Patroni, PgBouncer, and seamless major version upgrades under load.
- ClickHouse mastery: experience operating large clusters, understanding ZooKeeper/ClickHouse Keeper, sharding, replication internals, and diagnosing performance issues at the data-part level.
- Engineering Mindset (SRE/DevOps) with experience writing complex Terraform modules and Ansible roles.
- Programming skills in Python or Go for automation.
- Experience in Hybrid Environments, understanding differences between Bare Metal, Kubernetes, Cloud, and optimizing TCO/disk subsystem performance (NVMe, Network Storage).
- Systems approach, understanding security (FIPS, Audit logs) and Disaster Recovery.
- Openness to modern workflows and integrating AI into day-to-day operations.
- Experience building an Internal Developer Platform (IDP) (Nice to Have).
- Experience operating databases in Kubernetes (CloudNativePG, Altinity Operator) (Nice to Have).
- Experience working in Cloud and Hosting providers on similar services (Nice to Have).
Responsibilities
- Design and implement a self-service platform based on Terraform and Ansible for deploying HA clusters (PostgreSQL, ClickHouse, MongoDB, Redis) across heterogeneous environments (Bare Metal, OpenNebula, Kubernetes, Public Clouds).
- Manage and scale exponentially growing ClickHouse analytics clusters (12+ clusters, tens of terabytes of data), addressing sharding, table engine optimization, and building reliable S3 backup pipelines.
- Maintain and scale infrastructure for Apache Airflow and Redash, ensuring reliability of ETL pipelines and visualization tools.
- Implement SRE practices in data management, replacing manual incident response with automated self-healing mechanisms and defining/implementing SLO/SLI for all databases.
- Lead the migration process from legacy solutions to modern cloud patterns and participate in decisions regarding Kubernetes operators for stateful workloads.
- Serve as the technical authority for product teams, helping them optimize data schemas and SQL queries for high-load systems.
View Full Description & ApplyYou'll be redirected to the employer's site