Data Science with Rust - Arrow, DataFusion, and Ballista

Andy will explain why Rust is ideally suited for building the next generation of distributed compute platforms that are necessary for modern data science and will give an update on the current status of the various related projects that he is involved in.

Apache Arrow (https://arrow.apache.org/) defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

DataFusion (https://docs.rs/datafusion/1.0.1/datafusion/), now part of the Arrow project, is an in-memory query engine implement in Rust that provides SQL and DataFrame APIs for querying CSV and Parquet files (as well as custom data sources).

Ballista (https://github.com/ballista-compute/ballista) is a distributed compute platform loosely modeled after Apache Spark and primarily implemented in Rust, that leverages Arrow and DataFusion.

Speaker: Andy Grove

Andy Grove is a PMC member of Apache Arrow, where he donated the initial Rust implementation as well as the DataFusion query engine and has more recently become a contributor to Apache Spark.

#data-science #rust #programming #developer

youtube.com

Data Science with Rust - Arrow, DataFusion, and Ballista