In Part 1(you can read it here), I discussed the Business Case for Predicting Visitor-to-Customer Conversion for an Online Store and covered Exploratory Data Analysis of the training dataset.

In this part, I will cover Data Preprocessing and the Application of Supervised Learning Algorithms, namely RandomForest and XGBoost to the prepared training dataset.

So without further ado, let’s go to Data Preprocessing!

  1. Data Preprocessing

“What you sow, so you reap”. This proverb, so true for life in general is also very much true for Data Science ! We cannot feed crappy data to our algorithms and expect them to magically give us accurate predictions.

Getting the data ready in a form that can be fed into a learning algorithm is a vital task that a Data Scientist does.

As I mentioned in Part 1, the attributes in this data challenge were discrete and continuous with a widely varying ranges as well as categorical, with widely varying class sizes. Take a look at the short document that I have created to describe the attributes here.

The key elements of the Data Preprocessing Strategy that I used are the following:

  1. Do One-Hot-Encoding of all categorical attributes
  2. Convert continuous and discrete attributes that present in the string form in the dataset to float form
  3. Scale the all numerical (continuous and discrete) attributes
  4. Combine all the attributes of a data point into one list with each value in the range 0 and 1
  5. Create a Panda Dataframe for the Training and Test Datasets

The Panda Dataframe is then used for feeding data to the machine learning algorithms.

#digital-marketing #machine-learning #kaggle #analytics #data-science

Pre-Processing and Applying Machine Learning Algorithms (RandomForest and XGBoost)
10.50 GEEK