Unified Data Preprocessing & ML Pipelines with Ray Datasets

Speakers: Alex Wu, Clark Zinzow

Summary
ML tasks such as distributed training and batch inference stretch the abstractions of modern data processing systems, leading to performance or learning efficiency tradeoffs. In this talk we introduce Ray Dataset, a universal compatibility layer built on Arrow and Python that allows data processing to be combined with ML pipelines without such tradeoffs.