Staff Software Engineer - Grafana Databases, Managed Services
New
United KingdomFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- AWSGCPKafkaKubernetesAzureCassandraClickhouseGoLinuxTerraformHelm
Requirements
- 8+ years of software engineering experience in SRE, platform engineering, infrastructure, or distributed systems roles
- Strong experience with large-scale streaming or database systems (e.g., Kafka, Redpanda, ClickHouse, Cassandra, or similar)
- Hands-on expertise with Kubernetes in AWS, GCP, or Azure environments
- Proficiency in infrastructure-as-code tools such as Terraform, Helm, or similar
- Strong programming skills in systems-oriented languages (Go preferred)
- Deep understanding of distributed systems behavior, failure modes, and performance trade-offs
- Experience with observability, incident response, and writing post-incident reviews
- Strong knowledge of Linux internals, networking, storage systems, and cloud architecture
- Proven ability to lead technical initiatives and influence architectural decisions without formal authority
- Excellent communication skills with the ability to work effectively in remote, cross-functional teams
Responsibilities
- Operate and evolve large-scale multi-cloud streaming and database infrastructure across production environments
- Diagnose and resolve complex cross-layer failures involving storage, compute, networking, and control-plane systems
- Design and implement safe rollout, upgrade, and migration strategies across distributed systems at scale
- Improve observability, automation, and operational tooling to reduce system toil and increase reliability
- Define and evolve SLOs, error budgets, and reliability standards for shared infrastructure systems
- Partner with engineering teams to optimize query performance, data partitioning, and system scalability
- Serve as a primary escalation point for high-severity incidents and lead deep root cause analysis efforts
- Drive long-term architectural improvements to reduce systemic risks across multi-cluster environments
- Mentor engineers and contribute to best practices in distributed systems engineering and operational excellence
View Full Description & ApplyYou'll be redirected to the employer's site