Featherless AI

Private Company
ShareTweet

Open Positions13

Remote (world)Full-TimePosted
  • Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
  • Improve distributed training strategies (data, model, and pipeline parallelism)
  • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
  • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
  • Collaborate with researchers on architecture-aware training strategies
  • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
  • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
  • Own training performance metrics and continuously push them forward
PyTorchDistributed Systems
Showing 1 of 13 positions

Similar Companies