Pre-processing and “data wrangling” take up a lot of time, but it’s not always the most fun part of a data analysis project.Having to unpack a list comes up a lot when it comes to reformatting your initial data. Fortunately, Pandas comes with a lot of vectorized solutions to common problems, so we won’t have to stress too hard about unpacking lists in a DataFrame.In this piece, we’ll be looking at two things:
df.explode()
to unnest a column with list-like values in a DataFrame;How to use Series.str.split()
to create a list from a string.We’ll be using a modified version of this video game sales data, so you can download the csv file if you want to follow along. This time, I included the code to get the initial tables for each of the examples at the very bottom. I’d suggest going through the piece first, and then copying the code to get the inputs to try out the examples after.
Our initial table looks like this:
The goal is to separate all the values in the “Genre” column, so that each row only has one value. In terms of database normalization, this would be a step towards fulfilling the “first normal form”, where each column only contains atomic (non-divisible) values.To do so, all we need to do is use the df.explode()
function. We only need to pass one argument, which is the name of the column with the list like values.Our code is like this.
df2 = df2.explode('Genre').drop_duplicates()
A subset of the resulting DataFrame looks like this:
Now we have a table with all the different Genres of each Publisher. We also made sure to only have unique values by passing .drop_duplicates()
on the DataFrame.You’ll notice that the index value from the original DataFrame was kept for each of the rows, which goes to show that all the df.explode()
function does is separate elements in the iterable while keeping all other row values the same. We could also easily pass .reset_index()
if we wanted new index values.
#technology #data-science #programming #python #data-analysis #data analysis