Core & ML Ops Team Lead
Z
ZyteData, Web Data Extraction
Kraków, Lesser Poland Voivodeship, Poland. Rio de Janeiro, State of Rio de Janeiro, Brazil. Lisbon, Lisbon, Portugal. Budapest, Budapest, Hungary. Barcelona, Catalonia, SpainFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonJavaKafkaKubernetesC++AirflowGoRustLinux
Requirements
- 5+ years experience building distributed systems
- 3+ years in MLOps/ML platform engineering (or equivalent impact)
- Knowledge of Linux/OS internals (process model, cgroups/namespaces)
- Knowledge of networking (TCP/IP, HTTP/2), concurrency, and performance profiling
- Deep understanding of Kubernetes (bonus: Mesos)
- Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks)
- Strong Python skills
- Experience with GPU infrastructure (scheduling, containerization, optimization)
- Track record of designing and operating model platforms (registry, training, serving, monitoring) in production
- Demonstrated success leading technical teams and implementing organization-wide platform solutions
- Preferred: Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents
- Preferred: eBPF‑based observability, perf tooling, or io_uring experience
- Preferred: Cost optimization for ML/AI; multi‑tenant quotas and fairness
- Preferred: Hands‑on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds)
- Preferred: SRE practices (SLIs/SLOs, incident management)
Responsibilities
- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)
- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring
- Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts, high-performance clients, circuit breakers and other production‑ready defaults
- Operate a secure, multi‑tenant model registry and training platform with standardized experiment/evaluation harnesses
- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks
- Integrate public/open‑source AI capabilities as managed platform services with cost and data‑governance guardrails
- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards
- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans
- Mentor the team and foster a platform-thinking mindset
View Full Description & ApplyYou'll be redirected to the employer's site