Looking for strings to cut down your dataset for analysis and machine learning

The Pandas library is a comprehensive tool not only for crunching numbers but also for working with text data.

For many data analysis applications and machine learning exploration/pre-processing, you’ll want to either filter out or extract information from text data. To do so, Pandas offers a wide range of in-built methods that you can use to add, remove, and edit text columns in your DataFrames.

In this piece, let’s take a look specifically at searching for substrings in a DataFrame column. This may come in handy when you need to create a new category based on existing data (for example during feature engineering before training a machine learning model).

If you want to follow along, download the dataset here.

import pandas as pd

df = pd.read_csv('vgsales.csv')

Now let’s get started!

NOTE: we’ll be using a lot of _loc_ in this piece, so if you’re unfamiliar with that method, check out the first article linked at the very bottom of this piece.

#python #data-science #software-development #check for a substring in a pandas dataframe column #pandas dataframe column #check for a substring

Check For a Substring in a Pandas DataFrame Column
1.15 GEEK