In this article , I will talk about different types of categorical data and how to approach a problem with categorical variables.

What are categorical variables?

Categorical variables/features are any feature type can be classified into two major types:

  • Nominal
  • Ordinal

Image for post

Nominal variables are variables that have two or more categories which do not have any kind of order associated with them.

For example, if gender is classified into two groups, i.e. male and female, it can be considered as a nominal variable.

Ordinal variables, on the other hand, have “levels” or categories with a particular order associated with them. For example, an ordinal categorical variable can be a feature with three different levels: low, medium and high. Order is important.

As far as definitions are concerned, we can also categorize categorical variables as binary, i.e., a categorical variable with only two categories. Some even talk about a type called “cyclic” for categorical variables. Cyclic variables are present in “cycles” for example, days in a week: Sunday, Monday, Tuesday, Wednesday, Thursday, Friday and Saturday. After Saturday, we have Sunday again. This is a cycle.

Another example would be hours in a day if we consider them to be categories. There are many different definitions of categorical variables, and many people talk about handling categorical variables differently depending on the type of categorical variable.

However, I do not see any need for it. All problems with categorical variables can be approached in the same way.

There are many ways we can encode these categorical variables as numbers and use them in an algorithm. I will cover most of them from basic to more advanced ones in this post. I will be comprising these encoding:

1) One Hot Encoding

2) Label Encoding

3) Ordinal Encoding

4) Helmert Encoding

For explanation, I will use this data-frame, which has two independent variables or features(Temperature and Color) and one label (Target). It also has Rec-No, which is a sequence number of the record. There is a total of 10 records in this data-frame. Python code would look as below.

Image for post

#machine-learning #categorical-data #artificial-intelligence #data-science #pandas

Approaching categorical variables
15.05 GEEK