Kafka Platform Engineer
New
This is a fully remote opportunity within the continental United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonBashApache KafkaGoGrafanaPrometheusTerraformAnsibleDatadog
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- 5+ years of hands-on experience operating Apache Kafka or Confluent Platform in production environments.
- Deep understanding of Kafka internals including partitions, replication, ISRs, and consumer groups.
- Strong expertise in Kafka security practices including SASL, mTLS, ACLs, and RBAC.
- Experience with Kafka Connect, Schema Registry, Kafka Streams, or ksqlDB in enterprise environments.
- Strong scripting and automation skills using Python, Bash, or Go.
- Experience with Infrastructure as Code tools such as Terraform and Ansible.
- Knowledge of observability and monitoring solutions for distributed systems and streaming platforms.
- Familiarity with high availability, disaster recovery, and multi-region streaming architectures.
- Excellent troubleshooting, communication, and documentation abilities.
Responsibilities
- Architect, deploy, and maintain large-scale Apache Kafka and Confluent Platform environments across cloud and on-premise infrastructures.
- Design scalable partitioning, replication, and topic management strategies to optimize throughput, durability, and operational efficiency.
- Implement and manage platform security using SASL, mTLS, ACLs, RBAC, and identity provider integrations.
- Operate and optimize ecosystem components such as Schema Registry, Kafka Connect, ksqlDB, and Kafka Streams for production-grade streaming workloads.
- Develop CI/CD and GitOps workflows for topic management, connectors, and infrastructure automation.
- Build high-availability and disaster recovery strategies including multi-region replication and failover patterns.
- Implement observability and monitoring solutions using tools such as Prometheus, Grafana, Datadog, and related platforms.
- Collaborate with application teams to define best practices, onboarding standards, and reusable streaming patterns.
- Lead incident response, troubleshooting, and post-incident reviews to improve operational resilience and platform reliability.
- Mentor engineers through technical reviews, knowledge sharing, and engineering best practices while maintaining detailed technical documentation.
View Full Description & ApplyYou'll be redirected to the employer's site