Senior DevOps / SRE Engineer - AI Trading Agents

M
MLabsFintech, AI
United States. United Kingdom, Based in US to GMT timezones, US to GMT timezonesFull-TimeSenior
Salary120000 - 150000 USD per year
Apply NowOpens the employer's application page

Job Details

Required Skills
DockerNode.jsPythonAWS EKSBlockchainKafkaKubernetesClickhouseGoGrafanaPostgresPrometheusRedisCI/CDDevOpsTerraformAnsibleDatadogHelm

Requirements

  • Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up.
  • Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS.
  • Proficiency with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or equivalent frameworks.
  • Hands-on experience with Docker and Helm for packaging production services.
  • Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka).
  • Strong experience with Observability Tooling: Prometheus, Grafana, Datadog, Loki, or OpenTelemetry.
  • Ability to debug across multiple languages, including Python, Node.js, and Go.
  • Understanding of Real-Time Systems where latency and reliability have direct financial consequences.
  • Familiarity with Blockchain Infrastructure: node infrastructure, exchange APIs, wallet operations, and on-chain monitoring.
  • Experience managing secrets, access controls, and production hardening for sensitive financial environments.
  • Experience defining SLOs and building mature on-call practices.
  • Experience with OpenClaw agent deployments and workspace templates (Preferred).
  • Familiarity with Model Context Protocol (MCP) server deployment and auth management (Preferred).
  • Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols (Preferred).
  • Background in fintech, market data infrastructure, or high-frequency trading systems (Preferred).

Responsibilities

  • Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes.
  • Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity.
  • Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity.
  • Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change.
  • Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs.
  • Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka.
  • Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems.
  • Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability.
View Full Description & ApplyYou'll be redirected to the employer's site
120000 - 150000 USD per year
Apply Now