Senior Software Engineer - Grafana Databases, Managed Services

G
Grafana LabsOpen-source Software
UK time zones only, UK time zonesFull-TimeSenior
Salary91755 - 110106 GBP per year
Apply NowOpens the employer's application page

Job Details

Experience
6+ years
Required Skills
AWSGCPKafkaKubernetesSnowflakeAzureCassandraClickhouseGoPostgresLinuxTerraformHelm

Requirements

  • 6+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles
  • Experience operating distributed systems in production (e.g., streaming systems, analytical databases, large-scale storage backends)
  • Experience with Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra
  • Strong Kubernetes experience in AWS, GCP, or Azure
  • Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet)
  • Solid understanding of distributed systems design and large-scale system trade-offs
  • Proficiency in at least one programming language (Go preferred)
  • Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior
  • Experience participating in blameless incident response and writing high-quality post-incident reviews
  • Clear communicator who can collaborate across teams and work autonomously

Responsibilities

  • Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure
  • Diagnosing and eliminating cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.)
  • Designing safe upgrade and rollout strategies at scale
  • Improving observability, automation, and operational ergonomics
  • Partnering closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance
  • Working directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc.
  • Serving as a primary escalation point and on-call for relevant incidents
  • Owning the relationship with all system vendors, including WarpStream Labs and others
  • Reviewing and defining SLOs for shared database infrastructure, proactively reducing error budgets
  • Implementing solutions that ensure reliability, scalability, and performance of high-throughput, multi-cloud infrastructure
  • Developing fault-tolerant patterns that account for distributed system realities
  • Planning and executing safe upgrades and rollouts across dozens of production clusters
  • Collaborating with database and platform engineering leaders to influence architecture, roadmap priorities, and long-term strategy
  • Participating in PR review and contributing to design documents, automation, tooling, and code improvements
  • Sharing best practices and distributed systems knowledge with partner teams
  • Participating in incident response, from investigation through resolution and post-incident reviews (PIR)
View Full Description & ApplyYou'll be redirected to the employer's site
91755 - 110106 GBP per year
Apply Now