Kafka Platform Engineer

New
This is a fully remote opportunity within the continental United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonBashApache KafkaGoGrafanaPrometheusTerraformAnsibleDatadog

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of hands-on experience operating Apache Kafka or Confluent Platform in production environments.
  • Deep understanding of Kafka internals including partitions, replication, ISRs, and consumer groups.
  • Strong expertise in Kafka security practices including SASL, mTLS, ACLs, and RBAC.
  • Experience with Kafka Connect, Schema Registry, Kafka Streams, or ksqlDB in enterprise environments.
  • Strong scripting and automation skills using Python, Bash, or Go.
  • Experience with Infrastructure as Code tools such as Terraform and Ansible.
  • Knowledge of observability and monitoring solutions for distributed systems and streaming platforms.
  • Familiarity with high availability, disaster recovery, and multi-region streaming architectures.
  • Excellent troubleshooting, communication, and documentation abilities.

Responsibilities

  • Architect, deploy, and maintain large-scale Apache Kafka and Confluent Platform environments across cloud and on-premise infrastructures.
  • Design scalable partitioning, replication, and topic management strategies to optimize throughput, durability, and operational efficiency.
  • Implement and manage platform security using SASL, mTLS, ACLs, RBAC, and identity provider integrations.
  • Operate and optimize ecosystem components such as Schema Registry, Kafka Connect, ksqlDB, and Kafka Streams for production-grade streaming workloads.
  • Develop CI/CD and GitOps workflows for topic management, connectors, and infrastructure automation.
  • Build high-availability and disaster recovery strategies including multi-region replication and failover patterns.
  • Implement observability and monitoring solutions using tools such as Prometheus, Grafana, Datadog, and related platforms.
  • Collaborate with application teams to define best practices, onboarding standards, and reusable streaming patterns.
  • Lead incident response, troubleshooting, and post-incident reviews to improve operational resilience and platform reliability.
  • Mentor engineers through technical reviews, knowledge sharing, and engineering best practices while maintaining detailed technical documentation.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now