Pandas users: here is how you eat your bamboo in chunks and strip away unnecessary parts ASAP.

The Yelp data set has become a popular go-to place for data science projects, possibly because it is 1) relevant to daily life; 2) ‘realistically’ sized like production data. For example, yelp_academic_dataset_review.json occupies 5.9 GB by itself on-disk, enough to give modern laptops pause when trying to load it.

In this article, I will share some tips on how to load the Yelp JSON files efficiently. Hopefully, these tips are transferrable to other projects that you may pursue!

All code blocks displayed below can be copied into a Jupyter notebook cell and executed in a self-contained manner. A benchmark is included at the end to demonstrate the substantial speed-up achieved.

(Addendum, April 4th, 2021) Technically, the Yelp files are **line delimited **JSON files: each line represents a separate JSON object, instead of the whole file being a single, gigantic JSON object. A proper extension for such files is .jsonl, but companies may not make the distinction when releasing data. Thanks to Ron Li for pointing these things out.

#technology #pandas #json #data

Load Yelp reviews (or other huge JSON files) with ease
1.60 GEEK