Core & ML Ops Team Lead

Z
ZyteData, Web Data Extraction
Kraków, Lesser Poland Voivodeship, Poland. Rio de Janeiro, State of Rio de Janeiro, Brazil. Lisbon, Lisbon, Portugal. Budapest, Budapest, Hungary. Barcelona, Catalonia, SpainFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonJavaKafkaKubernetesC++AirflowGoRustLinux

Requirements

  • 5+ years experience building distributed systems
  • 3+ years in MLOps/ML platform engineering (or equivalent impact)
  • Knowledge of Linux/OS internals (process model, cgroups/namespaces)
  • Knowledge of networking (TCP/IP, HTTP/2), concurrency, and performance profiling
  • Deep understanding of Kubernetes (bonus: Mesos)
  • Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks)
  • Strong Python skills
  • Experience with GPU infrastructure (scheduling, containerization, optimization)
  • Track record of designing and operating model platforms (registry, training, serving, monitoring) in production
  • Demonstrated success leading technical teams and implementing organization-wide platform solutions
  • Preferred: Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents
  • Preferred: eBPF‑based observability, perf tooling, or io_uring experience
  • Preferred: Cost optimization for ML/AI; multi‑tenant quotas and fairness
  • Preferred: Hands‑on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds)
  • Preferred: SRE practices (SLIs/SLOs, incident management)

Responsibilities

  • Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)
  • Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring
  • Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts, high-performance clients, circuit breakers and other production‑ready defaults
  • Operate a secure, multi‑tenant model registry and training platform with standardized experiment/evaluation harnesses
  • Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks
  • Integrate public/open‑source AI capabilities as managed platform services with cost and data‑governance guardrails
  • Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards
  • Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans
  • Mentor the team and foster a platform-thinking mindset
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now