Staff Platform Engineer

New

TopstepFintech

United StatesFull-TimeStaff

Salary205,000 - 235,000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
Required Skills: AWSPythonBashKubernetesCI/CDTerraformGitHub ActionsDatadog

Requirements

7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
Proven track record either starting a platform/SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability.
Deep expertise with AWS infrastructure (EKS, EC2, RDS/Aurora, VPC, ALB/NLB, CloudFront, SQS) running production services at scale.
Strong proficiency with Datadog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue.
Hands-on experience building and maintaining CI/CD pipelines (GitHub Actions, CodePipeline, or similar), writing automation (Bash, Python), and contributing to platform tooling.
Strong proficiency with Kubernetes in production cluster operations, networking, security, scaling strategies, and GitOps workflows.
Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries.
Deep familiarity with Terraform for multi-account, multi-environment infrastructure management.
Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership.
Excellent communication skills with the ability to explain complex system behavior, trade-offs, and pragmatic decisions between long-term platform vision and immediate business needs to varied audiences.

Responsibilities

Provide technical leadership for infrastructure, reliability, and observability, driving architectural decisions and platform standards.
Build and mature the platform engineering practice defining SLOs, incident response protocols, on-call standards, and operational runbooks.
Own the observability stack using Datadog (metrics, APM, logging, distributed tracing) and CloudWatch, instrumenting systems and closing gaps that currently prevent fast diagnosis of production issues.
Design and evolve AWS infrastructure (EKS, Aurora, ElastiCache, SQS, CloudFront) for reliability, security, scalability, and cost efficiency.
Own and evolve CI/CD pipelines, deployment strategies, and release engineering practices across the organization.
Drive infrastructure-as-code strategy with Terraform across a multi-account AWS environment, ensuring consistency and repeatability.
Lead incident response and blameless post-mortems, turning outages into opportunities for systematic improvement.
Partner with product engineering teams to embed reliability principles early in the design process and improve system resilience.
Mentor engineers across the organization on infrastructure, reliability practices, operational thinking, and production ownership.
Champion a culture of transparency, continuous improvement, and shared ownership of production systems.

View Full Description & ApplyYou'll be redirected to the employer's site