Processing 19 million rows CSV in 1 second with Rust

As part of my new years resolution, I need exploration of the Rust. Here is my last article about the Rust.

Today’s use case is the dataset from Kaggle, which contain bike share information for Montreal from 2016 to 2019.

The dataset is followed:

start_date: Date and time of the start of the trip ( AAAA-MM-JJ hh: mm )

start_station_code: Start station ID

end_date: Date and time of the start of the trip ( AAAA-MM-JJ hh: mm )

end_station_code : Endstation ID

is_member : Type users. (1: Subscriber, 0: Non-subscriber)

duration_sec: Total travel time in seconds

The dataset has 35 files and around 1G.

For a simple use case, we want to find out all the is_member lines and get the total counts.

The output should be :

Image for post

output of processed

All is about the design, the application design is important, think is much important before you start coding.

Good software design and architecture come from the whiteboard and not from the keyboard.

in order to maximize the single machine, we will consider using the worker pool and build-in iterator.

Image for post

Create a channel of unbounded capacity.

Based on the CPU number of your machine, create the worker pool.

Image for post

#rustlang #rust

levelup.gitconnected.com