Some of the most helpful Pandas tricks

Image for post

Photo by Alvaro Reyes on Unsplash

In this article, you’ll learn some of the most helpful Pandas tricks to speed up your data analysis.

  1. Select columns by data types
  2. Convert strings to numbers
  3. Detect and handle missing values
  4. Convert a continuous numerical feature into a categorical feature
  5. Create a DataFrame from the clipboard
  6. Build a DataFrame from multiple files

Please check out my Github repo for the source code.

1. Select columns by data types

Here are the data types of the Titanic DataFrame

df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Let’s say you need to select the numeric columns.

df.select_dtypes(include='number').head()

Image for post

This includes both int and float columns. You could also use this method to

  • select just object columns
  • select multiple data types
  • exclude certain data types
# select just object columns
df.select_dtypes(include='object')

# select multiple data types
df.select_dtypes(include=['int', 'datetime', 'object'])
# exclude certain data types
df.select_dtypes(exclude='int')

2. Convert strings to numbers

There are two methods to convert a string into numbers in Pandas:

  • the astype() method
  • the to_numeric() method

Let’s create an example DataFrame to have a look at the difference.

df = pd.DataFrame({ 'product': ['A','B','C','D'], 
                   'price': ['10','20','30','40'],
                   'sales': ['20','-','60','-']
                  })

Image for post

The price and sales columns are stored as strings and so result in object columns:

df.dtypes

product    object
price      object
sales      object
dtype: object

We can use the first method astype() to perform the conversion on the price column as follows

# Use Python type
df['price'] = df['price'].astype(int)

# alternatively, pass { col: dtype }
df = df.astype({'price': 'int'})

However, this would have resulted in an error if we tried to use it on the sales column. To fix that, we can use to_numeric() with argument errors='coerce'

df['sales'] = pd.to_numeric(df['sales'], errors='coerce')

Now, invalid values - get converted into NaN and the data type is float.

Image for post

#python #machine-learning #data-science #pandas #pandas-dataframe

6 Pandas tricks you should know to speed up your data analysis
6.90 GEEK