Staff Software Engineer - Grafana Databases, Managed Services

New
United KingdomFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
AWSGCPKafkaKubernetesAzureCassandraClickhouseGoLinuxTerraformHelm

Requirements

  • 8+ years of software engineering experience in SRE, platform engineering, infrastructure, or distributed systems roles
  • Strong experience with large-scale streaming or database systems (e.g., Kafka, Redpanda, ClickHouse, Cassandra, or similar)
  • Hands-on expertise with Kubernetes in AWS, GCP, or Azure environments
  • Proficiency in infrastructure-as-code tools such as Terraform, Helm, or similar
  • Strong programming skills in systems-oriented languages (Go preferred)
  • Deep understanding of distributed systems behavior, failure modes, and performance trade-offs
  • Experience with observability, incident response, and writing post-incident reviews
  • Strong knowledge of Linux internals, networking, storage systems, and cloud architecture
  • Proven ability to lead technical initiatives and influence architectural decisions without formal authority
  • Excellent communication skills with the ability to work effectively in remote, cross-functional teams

Responsibilities

  • Operate and evolve large-scale multi-cloud streaming and database infrastructure across production environments
  • Diagnose and resolve complex cross-layer failures involving storage, compute, networking, and control-plane systems
  • Design and implement safe rollout, upgrade, and migration strategies across distributed systems at scale
  • Improve observability, automation, and operational tooling to reduce system toil and increase reliability
  • Define and evolve SLOs, error budgets, and reliability standards for shared infrastructure systems
  • Partner with engineering teams to optimize query performance, data partitioning, and system scalability
  • Serve as a primary escalation point for high-severity incidents and lead deep root cause analysis efforts
  • Drive long-term architectural improvements to reduce systemic risks across multi-cluster environments
  • Mentor engineers and contribute to best practices in distributed systems engineering and operational excellence
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now