Senior Site Reliability Engineer - Datacraft

New

Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basisFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Required Skills: PythonSQLGCPKubernetesMongoDBJiraAirflowApache KafkaGoGrafanaPrometheusRedisSparkTerraformConfluenceBigQueryDatabricksGitLab

Demonstrable impact in transforming engineering workflows and fostering an SRE/DevOps culture.
Ability to connect reliability work to business success and customer outcomes.
Embrace the "you build it, you run it" principle.
Cost-aware with effective vertical and horizontal autoscaling and detailed telemetry insights.
Infrastructure as Code is foundational for stability.
Design for failure: SLOs, error budgets, and runbooks are first-class artifacts.
Use telemetry and metrics to provide actionable feedback on application and service behavior.
Ability to navigate complex data platform architectures using distributed tracing and debugging.
Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
Experience with Python.
Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg).
Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar).
Comfortable with on-call rotation and 24/7 incident response.
Remote-first mindset for effective distributed team collaboration.
Ability to learn and adapt to new tech and a growing codebase.

Build and maintain the reliability ecosystem for DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
Ensure end-to-end observability across the full data platform, from Kafka ingest to Databricks and BigQuery destinations.
Drive scalability for services based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).
Own and evolve Terraform-based infrastructure for DataCraft services.
Automate deployments, instance setup, and operational runbooks.
Maintain CI/CD pipelines (GitLab) with linters, security scans, code quality checks, and AI code reviews.
Help the team fulfill security requirements for ISO and SOC2 audits, enforcing security principles.
Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).
Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
Use telemetry data and distributed tracing to navigate complex, distributed service architectures.
Ensure reliability and observability of the Loomi Analytics Agent data infrastructure.
Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses.

View Full Description & ApplyYou'll be redirected to the employer's site