Rust is the most beloved language, according to StackOverflow, it is on the top of the list for four years! Data processing is getting simpler and faster with a framework like Apache Spark. However, the field of data processing is competitive. DataFusion (part of Arrow now) is one of the initial attempts of bringing data processing to the Rust. If you are interested in learning some aspects of data processing in Rust with DataFusion, I will show some code examples in Rust with DataFusion, as well as compare the query performance between DataFusion and Pandas.

DataFusion

Andy Grove created DataFusion, and he had some great articles about building modern distributed computing, for example, How To Build A Modern Distributed Compute Platform. The DataFusion project is not for the production environment yet, as Andy mentioned,

“This project is a great way to learn about building a query engine, but this is quite early and not usable for any real-world work just yet.”

The project was donated to the Apache Arrow project in February 2019, and more people start to contribute to the Arrow version of DataFusion.

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

The project description may not deliver too much excitement here, but since the entire project is done in Rust, it provides you ideas about writing your analytics SQL in Rust. Additionally, you can bring DataFusion as a library to your Cargo file for your Rust project easily.

Initial Setup

To test run some code with DataFusion, first, we need to create a new Rust package

cargo new datafusion_test --bin

Then bring DataFusion as a dependency in Cargo.toml file

[dependencies]
arrow = "0.15.0"
datafusion = "0.15.0"

#rust #programming #data #data-engineering #data-science #data analysis

DataFusion

Initial Setup

towardsdatascience.com

Data Processing In Rust With DataFusion (Arrow)