Parquet vs CSV: Advantages for Data Analysis

Speaker: Matthew Powers

Summary
Parquet files are well supported by most languages / libraries, are easier to work with, and typically more performant than CSV files. This talk summarizes the main benefits of Parquet files and shows how they’re faster with benchmarking analyses. You’ll also learn how to convert CSV files to Parquet.

Description
5 reasons Parquet files are better than CSV:

schema - examine how the schema is embedded in the file metadata leveraging PyArrow
file sizes - compare file sizes when identical data is written to CSV and Parquet
columnar file format - examine performance benefits from leveraging column pruning to skip data
predicate pushdown filtering - understand how to query row group metadata with PyArrow and how to skip entire row groups based on column metadata
immutable - why immutable file formats are better
How to convert CSV files to Parquet with Pandas, Dask, and PySpark. Will show how to convert a single file or multiple files in parallel.

When to use CSV files and when to avoid them.

Matthew Powers's Bio
Powers is a tech evangelist at Coiled.

He used Spark / PySpark for 6 years and is now help devs understand when Dask is a better fit.

He's written two books, has a popular blog, and regularly contributes to open source codebases.

In a past life, he passed all three CFA exams and worked in finance.

GitHub: https://github.com/MrPowers/
Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps

#csv #data-analysis #py #pydata

youtube.com

Parquet vs CSV: Advantages for Data Analysis