Fairly new to Predictive Algorithms, I realized that knowing how to feed in data to the algorithm and letting it effortlessly predict and display the ‘accuracy score’ of the model is not the best approach.

Image for post

When I was learning how to implement some of the Machine Learning algorithms, I mostly worked with ‘all-numeric’ datasets. Later, while experimenting with a dataset to understand how different algorithms work, I realized that there was a catch. Processing a dataset with certain Predictive Models, when some of the columns are ‘categorical’ i.e. textual, throws an error in Python. This led me to investigate how to work with categorical features.

One of the most obvious solutions to tackling categorical data is to convert them into numerical representations. Say, a column Colors [Red, Blue, Green and Black] could instead be represented by numbers 1,2,3,4, each number representing a color. This process of assigning a series of numbers to categorical values is called Integer Coding, performed using Label Encoder in Python. The process, in hindsight, assigns an order 0<1<2 < 3. For columns like Colors, there seems to be no particular order and therefore it makes more sense to interpret and represent them in a different manner.

Original table:

Image for post

Integer Coded table using LabelEncoder in Python:

Image for post

When you convert all your categorical columns into numeric data, the code snippet runs without errors. However, some may not realize that the accuracy score may have suffered as a result. If your Column consisted of Overall Ranks[Overall Ranks:’first’,’second’,’third’], ’first’ to 1, ‘second’ to 2 and ‘third’ to 3 would be sensible since the Ranks in fact do have an order among themselves.

So what is the alternative?

The get_dummies function, which comes in the pandas package in Python, is used to convert categorical variables to dummy variables. A new column is formed for each unique value. The Column Colors would be split into 4 columns: Red, Blue, Green and Black. A row value with color Red will be represented as 1,0,0,0; a row value with color Green will be represented as 0,0,1,0, and so on.

#label-encoding #categorical-data #random-forest #data-analysis #data analysis

Working with Categorical Predictors
1.10 GEEK