Staff Platform Engineer
New
T
TopstepFintech
United StatesFull-TimeStaff
Salary205,000 - 235,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
- Required Skills
- AWSPythonBashKubernetesCI/CDTerraformGitHub ActionsDatadog
Requirements
- 7+ years of professional experience in Platform Engineering, SRE, or Infrastructure Engineering, with demonstrated impact building practices that scaled across multiple teams.
- Proven track record either starting a platform/SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability.
- Deep expertise with AWS infrastructure (EKS, EC2, RDS/Aurora, VPC, ALB/NLB, CloudFront, SQS) running production services at scale.
- Strong proficiency with Datadog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue.
- Hands-on experience building and maintaining CI/CD pipelines (GitHub Actions, CodePipeline, or similar), writing automation (Bash, Python), and contributing to platform tooling.
- Strong proficiency with Kubernetes in production cluster operations, networking, security, scaling strategies, and GitOps workflows.
- Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries.
- Deep familiarity with Terraform for multi-account, multi-environment infrastructure management.
- Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership.
- Excellent communication skills with the ability to explain complex system behavior, trade-offs, and pragmatic decisions between long-term platform vision and immediate business needs to varied audiences.
Responsibilities
- Provide technical leadership for infrastructure, reliability, and observability, driving architectural decisions and platform standards.
- Build and mature the platform engineering practice defining SLOs, incident response protocols, on-call standards, and operational runbooks.
- Own the observability stack using Datadog (metrics, APM, logging, distributed tracing) and CloudWatch, instrumenting systems and closing gaps that currently prevent fast diagnosis of production issues.
- Design and evolve AWS infrastructure (EKS, Aurora, ElastiCache, SQS, CloudFront) for reliability, security, scalability, and cost efficiency.
- Own and evolve CI/CD pipelines, deployment strategies, and release engineering practices across the organization.
- Drive infrastructure-as-code strategy with Terraform across a multi-account AWS environment, ensuring consistency and repeatability.
- Lead incident response and blameless post-mortems, turning outages into opportunities for systematic improvement.
- Partner with product engineering teams to embed reliability principles early in the design process and improve system resilience.
- Mentor engineers across the organization on infrastructure, reliability practices, operational thinking, and production ownership.
- Champion a culture of transparency, continuous improvement, and shared ownership of production systems.
View Full Description & ApplyYou'll be redirected to the employer's site