Senior Site Reliability Engineer - Datacraft
New
B
BloomreachE-commerce
Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basisFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- PythonSQLGCPKubernetesMongoDBJiraAirflowApache KafkaGoGrafanaPrometheusRedisSparkTerraformConfluenceBigQueryDatabricksGitLab
Requirements
- Demonstrable impact in transforming engineering workflows and fostering an SRE/DevOps culture.
- Ability to connect reliability work to business success and customer outcomes.
- Embrace the "you build it, you run it" principle.
- Cost-aware with effective vertical and horizontal autoscaling and detailed telemetry insights.
- Infrastructure as Code is foundational for stability.
- Design for failure: SLOs, error budgets, and runbooks are first-class artifacts.
- Use telemetry and metrics to provide actionable feedback on application and service behavior.
- Ability to navigate complex data platform architectures using distributed tracing and debugging.
- Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
- Experience with Python.
- Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg).
- Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar).
- Comfortable with on-call rotation and 24/7 incident response.
- Remote-first mindset for effective distributed team collaboration.
- Ability to learn and adapt to new tech and a growing codebase.
Responsibilities
- Build and maintain the reliability ecosystem for DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
- Ensure end-to-end observability across the full data platform, from Kafka ingest to Databricks and BigQuery destinations.
- Drive scalability for services based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
- Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).
- Own and evolve Terraform-based infrastructure for DataCraft services.
- Automate deployments, instance setup, and operational runbooks.
- Maintain CI/CD pipelines (GitLab) with linters, security scans, code quality checks, and AI code reviews.
- Help the team fulfill security requirements for ISO and SOC2 audits, enforcing security principles.
- Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).
- Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
- Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
- Use telemetry data and distributed tracing to navigate complex, distributed service architectures.
- Ensure reliability and observability of the Loomi Analytics Agent data infrastructure.
- Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses.
View Full Description & ApplyYou'll be redirected to the employer's site