As a Data Scientist, our everyday work is consists of pulling out the data, understand the data, cleaning up the data, transforming the data, and creating new features. Notice I did not include creating a machine learning model? Because creating a model would be the last thing we did, and it is not necessarily our everyday work. Cleaning data, however, is daily work.

For the reason above, I want to present to you three beautiful Pandas tricks to make your data work a little bit easier.


1. Using query for data selection

Data selection is the most essential activity you would do as a Data Scientist, yet it is one of the most hassle things to do, especially when it is done repeatedly. Let me show you an example.

#Let's use a dataset example
import pandas as pd
import seaborn as sns

mpg = sns.load_dataset('mpg')
mpg.head()

Image for post

Above is our dataset example, let’s say I want to select the row that either has mpg less than 11 or horsepower less than 50 and model_year equal to 73. This means I need to write the code just like below.

mpg[(mpg['mpg'] < 11) | (mpg['horsepower'] <50) & (mpg['model_year'] ==73)]

Image for post

Typical data selection method result

This is the usual way to selecting data, but sometimes it is a hassle because of how wordy the condition is. In this case, we could use the query method from the Pandas Data Frame object.

So, what is this query method? It is a selection method from Pandas Data Frame with a more humanly word. Let me show you an example below.

mpg.query('mpg < 11 or horsepower < 50 and model_year == 73')

Image for post

The result from the query method

The result is exactly the same as the usual selection method, right? The only difference is with a query we have a less wordy condition, and we write it in the string where the query method accepts string English words like in the example.

Another simple difference between the usual selection method and the query method is the execution time. Let’s take a look at the example below.

Image for post

The usual selection method takes 18ms, and the query method takes 13ms to execute the code. In this case, the query method is a quicker selection method.


2. Replace values with replace, mask and where

While we are working with the data, I am sure there is a time where you need to replace some values in your columns with other specific values. It could be quite bothersome if we do it manually. Let’s say in my mpg dataset before I want to replace all the cylinders integer value into a word string value. Let me give you an example of how to replace it manually.

def change_value(x):
    if x == 3:
        return 'Three'
    elif x == 4:
        return 'Four'
    elif x == 5:
        return 'Five'
    elif x == 6:
        return 'Six'
    else:
        return 'Eight'
mpg['cylinders'] = mpg['cylinders'].apply(change_value) 

mpg.head()

#python #technology #data analysis

3 Pandas Trick to Easing Your Data Life
4.65 GEEK