Lead Site Reliability Engineer, Observability (Remote, North America)

Posted 3 months agoViewed

North AmericaFull-TimeAI Sales

Company:Vivun Inc.

Location:North America, EST, PST

Languages:English

Seniority level:Lead, 6+ years

Experience:6+ years

Skills:

LeadershipNode.jsPythonSoftware DevelopmentCloud ComputingGrafanaPrometheusCI/CDDevOps

Requirements:

6+ years of experience in SRE, DevOps, or Observability Engineering roles At least 2+ years leading or designing observability initiatives Deep knowledge of observability tooling (e.g., OpenTelemetry, Prometheus, Grafana, Datadog, Honeycomb, Observe) Experience with distributed tracing practices Experience with Agentic / LLM-based systems (e.g., LangChain, Celery, OpenAI APIs) Strong understanding of instrumenting, tracing, and correlating AI/LLM workflows Proven ability to define cross-team standards, influence engineering culture, and establish scalable monitoring patterns Strong collaboration and communication skills

Responsibilities:

Own the end-to-end observability strategy Design and implement correlation models Unify observability tooling Collaborate with engineering and QA on best practices Establish enablement frameworks Partner with teammates on reliability and incident response Contribute to performance and reliability strategy