Staff Platform Engineer

New
T
TopstepFintech
United StatesFull-TimeStaff
Salary205,000 - 235,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
Required Skills
AWSPythonBashKubernetesCI/CDTerraformGitHub ActionsDatadog

Requirements

  • 7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
  • Proven track record either starting a platform/SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability.
  • Deep expertise with AWS infrastructure (EKS, EC2, RDS/Aurora, VPC, ALB/NLB, CloudFront, SQS) running production services at scale.
  • Strong proficiency with Datadog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue.
  • Hands-on experience building and maintaining CI/CD pipelines (GitHub Actions, CodePipeline, or similar), writing automation (Bash, Python), and contributing to platform tooling.
  • Strong proficiency with Kubernetes in production cluster operations, networking, security, scaling strategies, and GitOps workflows.
  • Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries.
  • Deep familiarity with Terraform for multi-account, multi-environment infrastructure management.
  • Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership.
  • Excellent communication skills with the ability to explain complex system behavior, trade-offs, and pragmatic decisions between long-term platform vision and immediate business needs to varied audiences.

Responsibilities

  • Provide technical leadership for infrastructure, reliability, and observability, driving architectural decisions and platform standards.
  • Build and mature the platform engineering practice defining SLOs, incident response protocols, on-call standards, and operational runbooks.
  • Own the observability stack using Datadog (metrics, APM, logging, distributed tracing) and CloudWatch, instrumenting systems and closing gaps that currently prevent fast diagnosis of production issues.
  • Design and evolve AWS infrastructure (EKS, Aurora, ElastiCache, SQS, CloudFront) for reliability, security, scalability, and cost efficiency.
  • Own and evolve CI/CD pipelines, deployment strategies, and release engineering practices across the organization.
  • Drive infrastructure-as-code strategy with Terraform across a multi-account AWS environment, ensuring consistency and repeatability.
  • Lead incident response and blameless post-mortems, turning outages into opportunities for systematic improvement.
  • Partner with product engineering teams to embed reliability principles early in the design process and improve system resilience.
  • Mentor engineers across the organization on infrastructure, reliability practices, operational thinking, and production ownership.
  • Champion a culture of transparency, continuous improvement, and shared ownership of production systems.
View Full Description & ApplyYou'll be redirected to the employer's site
205,000 - 235,000 USD per year
Apply Now