Staff Software Engineer - Grafana Databases, Managed Services
G
Grafana LabsObservability
United Kingdom (Remote), UK time zones onlyFull-TimeStaff
Salary103958 - 124750 GBP per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- AWSGCPKafkaKubernetesSnowflakeAzureCassandraClickhouseGoPostgresLinuxTerraformHelm
Requirements
- 8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles
- Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure (e.g., Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra)
- Strong Kubernetes experience in AWS, GCP, or Azure
- Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet)
- Experience leading or driving complex technical efforts, even without formal management responsibilities
- Ability to influence technical direction and align teams around reliability improvements
- Strong understanding of distributed systems failure modes in multi-cloud environments
- Proficiency in at least one systems-oriented language (Go preferred, but not required)
- Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior
- Experience participating in blameless incident response and writing high-quality post-incident reviews
- Clear communicator who can collaborate across teams and work autonomously
Responsibilities
- Operate and evolve 100+ multi-cloud streaming clusters and related database infrastructure
- Diagnose and eliminate cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.)
- Design safe upgrade and rollout strategies at scale
- Improve observability, automation, and operational ergonomics
- Partner closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance
- Work directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc.
- Serve as a primary escalation point and on-call for relevant incidents
- Own the relationship with all system vendors, including WarpStream Labs and others
- Help define and evolve the technical direction for operating WarpStream and adjacent shared database systems at scale
- Lead complex initiatives such as migrations, rollout improvements, and reliability investments
- Establish best practices around SLOs, scaling limits, failure isolation, and change safety
- Investigate and drive resolution of multi-layer incidents spanning storage, compute, networking, and control-plane dependencies
- Identify systemic risks across 100+ clusters and contribute architectural improvements that reduce recurring issues
- Improve systems toil and operational ergonomics with automation
- Partner with database and platform teams to align on strategy and long-term scalability
- Mentor and support engineers as the team matures
View Full Description & ApplyYou'll be redirected to the employer's site