Senior Site Reliability Engineer - Datacraft

New
B
BloomreachE-commerce
Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basisFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
PythonSQLGCPKubernetesMongoDBJiraAirflowApache KafkaGoGrafanaPrometheusRedisSparkTerraformConfluenceBigQueryDatabricksGitLab

Requirements

  • Demonstrable impact in transforming engineering workflows and fostering an SRE/DevOps culture.
  • Ability to connect reliability work to business success and customer outcomes.
  • Embrace the "you build it, you run it" principle.
  • Cost-aware with effective vertical and horizontal autoscaling and detailed telemetry insights.
  • Infrastructure as Code is foundational for stability.
  • Design for failure: SLOs, error budgets, and runbooks are first-class artifacts.
  • Use telemetry and metrics to provide actionable feedback on application and service behavior.
  • Ability to navigate complex data platform architectures using distributed tracing and debugging.
  • Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
  • Experience with Python.
  • Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg).
  • Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar).
  • Comfortable with on-call rotation and 24/7 incident response.
  • Remote-first mindset for effective distributed team collaboration.
  • Ability to learn and adapt to new tech and a growing codebase.

Responsibilities

  • Build and maintain the reliability ecosystem for DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
  • Ensure end-to-end observability across the full data platform, from Kafka ingest to Databricks and BigQuery destinations.
  • Drive scalability for services based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
  • Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).
  • Own and evolve Terraform-based infrastructure for DataCraft services.
  • Automate deployments, instance setup, and operational runbooks.
  • Maintain CI/CD pipelines (GitLab) with linters, security scans, code quality checks, and AI code reviews.
  • Help the team fulfill security requirements for ISO and SOC2 audits, enforcing security principles.
  • Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).
  • Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
  • Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
  • Use telemetry data and distributed tracing to navigate complex, distributed service architectures.
  • Ensure reliability and observability of the Loomi Analytics Agent data infrastructure.
  • Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now