Linear Regression with Knime

Exploring the Dataset

LEGO is a popular brand of toy building bricks. They are often sold in sets to build a specific object. Each set is designed for a particular age-group, with a theme in mind and containing a different number of pieces. Each set has a different rating and price. Using this data, we want to design a Linear Regression model with Knime that can predict the price of a given Lego set.

The Lego Dataset we are using looks like this:

The different features in the dataset are:

FeatureDescriptionDataTypeageWhich age categories it belongs toStringlist_priceprice of the set (in $)Doublenum_reviewsnumber of reviews per setIntegerpiece_countnumber of pieces in that lego setIntegerplay_star___ratingratingsDoublereview_difficultydifficulty level of the setStringstar_ratingratingsDoubletheme_namewhich theme it belongsStringval___star_ratingratingsDoublecountrycountry nameString

Pre-Processing and Cleaning the data

Having a look at the data, you may notice that some of the features in the dataset are textual in nature. Thus, they don’t add value to the prediction model.

So, once the file is read into Knime using a **File Reader **node, we need to apply the first pre-processing step to the data. We will read the features with nominal values and map every category in that feature to an integer. Knime’s Category to Number nodes does the job for us.

Now, our complete dataset is in a numerical format. So, the next step is to remove any numeric outliers that may exist. Outliers are extreme values in a feature that deviate from other observations on data. They might exist due to experimental errors or variability in measurement. They need to be removed as they may have an effect on the statistics involved in the data. Knime’s Numeric Outliers node gives us an option to remove the rows with outliers.

After the outliers are removed, the next step is to use Knime’s Missing Value node that allows us to replace all missing values in a feature with a fixed value, the feature’s mean, or any other statistic.

Removing Multi-Collinearity

Linear Regression model works under the assumption that there is no relation between independent features. Correlation should exist only between the independent features and the target feature. If multi-collinearityexists, then the overall performance of the model is affected.

To calculate the correlation between the independent features, we configure the Rank Correlation node to use Spearman’s Rank Correlation. The output of the node is a** correlation matrix**.

From the output, it is clear that there are some independent features that are highly correlated to each other. To filter these columns out, we use Knime’s Correlation Filter node that allows us to set a threshold value on the correlation value of the output matrix. It filters out the columns with correlation more than the threshold value.

From the output of the above node, it is clear that we don’t want to keep star_rating, theme_name, and val_star_rating features. So, using the Column Filter node to our cleaned data, we filter out the unwanted features.

Train – Test Split

Finally, we have our dataset in a form that can be used for training a linear regressor and testing it. Before that, the last step we need to do is split the complete data into Train and Test data. To do so, we use Knime’s **Partitioning **node. In its configuration, we specify to split the data randomly with 70 % as our train data and the remaining as our test data.

Training and Testing the model

Knime provides a Linear Regression Learner and **Regression Predictor **node for creating a Linear Regression Learner and Predictor. We feed the train data from partitioning node to the Learner node, and it produces a Predictor Model.

Then, we feed the output model and Test data to the Predictor node that churns out the predicted values for lego set prices.