Machine Learning Pipelines: Nonlinear Model Stacking

Normally, we face data sets that are fairly linear or can be manipulated into one. But what if the data set that we are examining really should be looked at in a nonlinear way? Step into the world of nonlinear feature engineering. First, we’ll look at examples of nonlinear data. Next, we’ll briefly discuss the K-means algorithm as a means to nonlinear feature engineering. Lastly, we’ll apply K-means stacked on top of logistic regression to build a superior model for classification.Examples of Nonlinear DataNonlinear data occurs quite often in the business world. Examples include, segmenting group behavior (marketing), patterns in inventory by group activity (sales), anomaly detection from previous transactions (finance), etc. To a more concrete example (supply chain / logistics), we can even see it in a visualization of truck driver data of speeding against distance :

From a quick glance, we can see that there are at least 2 groups within this data set. A group split between above 100 distance and below 100 distance. Intuitively, we can see that fitting a linear model here would be horrendous. Thus, we need a different type model. Applying K-means, we can actually find four groups as seen below [1]:

With K-means, we can now assign additional analysis on the above drivers’ data set to produce predictive insights to help businesses categorize drivers’ distance traveled and their speeding patterns. In our case, we’ll apply K-means to our own fictitious data set to save us more steps of feature engineering real life data.K-MeansBefore we begin constructing our data, let’s take some time to go over what K-means actually is. K-means is an algorithm that looks for a certain number of clusters within an unlabeled data set [2]. Take note of the word unlabeled. This means that K-means is an unsupervised learning model. This is super helpful, when you get data but don’t really know how to label it. K-means can help out by labeling groups for you — pretty cool!Applying Nonlinear Feature EngineeringFor our data, we’ll use the make_circles data from sklearn [3]. Alright, let’s get to our hands on example:

#Load up our packages
import pandas as pd
import numpy as np
import sklearn
import scipy
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from scipy.spatial import Voronoi, voronoi_plot_2d
from sklearn.data sets.samples_generator import make_circles
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib notebook

Our next step is to use create a K-means class. For those of you unfamiliar with classes (not a subject you take in school), think of a class in coding as a super function that has a lot of functions inside it. Now, I know there’s already a k-means clustering algorithm in sklearn, but I really like this class made by Alice Zheng due to detailed comments and the visualization that we’ll soon see [4]:

class KMeansFeaturizer:
    """Transforms numeric data into k-means cluster memberships.
    
    This transformer runs k-means on the input data and converts each data point
    into the id of the closest cluster. If a target variable is present, it is 
    scaled and included as input to k-means in order to derive clusters that
    obey the classification boundary as well as group similar points together.    Parameters
    ----------
    k: integer, optional, default 100
        The number of clusters to group data into.    target_scale: float, [0, infty], optional, default 5.0
        The scaling factor for the target variable. Set this to zero to ignore
        the target. For classification problems, larger `target_scale` values 
        will produce clusters that better respect the class boundary.    random_state : integer or numpy.RandomState, optional
        This is passed to k-means as the generator used to initialize the 
        kmeans centers. If an integer is given, it fixes the seed. Defaults to 
        the global numpy random number generator.    Attributes
    ----------
    cluster_centers_ : array, [k, n_features]
        Coordinates of cluster centers. n_features does count the target column.
    """    def __init__(self, k=100, target_scale=5.0, random_state=None):
        self.k = k
        self.target_scale = target_scale
        self.random_state = random_state
        self.cluster_encoder = OneHotEncoder().fit(np.array(range(k)).reshape(-1,1))
        
    def fit(self, X, y=None):
        """Runs k-means on the input data and find centroids.        If no target is given (`y` is None) then run vanilla k-means on input
        `X`.         If target `y` is given, then include the target (weighted by 
        `target_scale`) as an extra dimension for k-means clustering. In this 
        case, run k-means twice, first with the target, then an extra iteration
        without.        After fitting, the attribute `cluster_centers_` are set to the k-means
        centroids in the input space represented by `X`.        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_data_points, n_features)        y : vector of length n_data_points, optional, default None
            If provided, will be weighted with `target_scale` and included in 
            k-means clustering as hint.
        """
        if y is None:
            # No target variable, just do plain k-means
            km_model = KMeans(n_clusters=self.k, 
                              n_init=20, 
                              random_state=self.random_state)
            km_model.fit(X)            self.km_model_ = km_model
            self.cluster_centers_ = km_model.cluster_centers_
            return self        # There is target information. Apply appropriate scaling and include
        # into input data to k-means            
        data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))        # Build a pre-training k-means model on data and target
        km_model_pretrain = KMeans(n_clusters=self.k, 
                                   n_init=20, 
                                   random_state=self.random_state)
        km_model_pretrain.fit(data_with_target)        # Run k-means a second time to get the clusters in the original space
        # without target info. Initialize using centroids found in pre-training.
        # Go through a single iteration of cluster assignment and centroid 
        # recomputation.
        km_model = KMeans(n_clusters=self.k, 
                          init=km_model_pretrain.cluster_centers_[:,:2], 
                          n_init=1, 
                          max_iter=1)
        km_model.fit(X)
        
        self.km_model = km_model
        self.cluster_centers_ = km_model.cluster_centers_
        return self
        
    def transform(self, X, y=None):
        """Outputs the closest cluster id for each input data point.        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_data_points, n_features)        y : vector of length n_data_points, optional, default None
            Target vector is ignored even if provided.        Returns
        -------
        cluster_ids : array, shape[n_data_points,1]
        """
        clusters = self.km_model.predict(X)
        return self.cluster_encoder.transform(clusters.reshape(-1,1))
    
    def fit_transform(self, X, y=None):
        """Runs fit followed by transform.
        """
        self.fit(X, y)
        return self.transform(X, y)

Don’t let that huge amount of text bother you. I just put it there incase you wanted to experiment with it on your own projects. Afterwards, we’ll create our training/test set, and set the seed to 420 to get the same results:

# Creating our training and test set
seed = 420training_data, training_labels = make_circles(n_samples=2000, factor=0.2)kmf_hint = KMeansFeaturizer(k=100, target_scale=10, random_state=seed).fit(training_data, training_labels)kmf_no_hint = KMeansFeaturizer(k=100, target_scale=0, random_state=seed).fit(training_data, training_labels)def kmeans_voronoi_plot(X, y, cluster_centers, ax):
    #Plots Voronoi diagram of k-means clusters overlaid with data
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='Set1', alpha=0.2)
    vor = Voronoi(cluster_centers)
    voronoi_plot_2d(vor, ax=ax, show_vertices=False, alpha=0.5)

Now, let’s look at our unlabeled nonlinear data:

#looking at circles data
df = pd.DataFrame(training_data)
ax = sns.scatterplot(x=0, y=1, data=df)

Just like our into data set of the drivers’, our circle within a circle is definitely not a linear data set. Next, we’ll apply K-means comparing visual results with giving it a hint on what we think and no hints:

#With hint
fig = plt.figure()
ax = plt.subplot(211, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_hint.cluster_centers_, ax)
ax.set_title('K-Means with Target Hint')#Without hint
ax2 = plt.subplot(212, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_no_hint.cluster_centers_, ax2)
ax2.set_title('K-Means without Target Hint')

I find that in hint versus no hint that the results are fairly close. If you want more automation, then you might want to apply no hint. But if you can spend some time looking at your data set to give it a hint, I would. The reason is it could save you some time in running the model, so k-means spends less time figuring out on its own. Another reason to give k-means a hint is you have domain expertise in your data set and know there are a specific number of clusters.Model Stacking for ClassificationTime for the fun part — making the stacked model. Some of you might be asking, what’s the difference between stacked model and ensemble model. An ensemble model combines multiple machine learning models to make another model [5]. So, not much. I think model stacking is more precise here, since k-means is feeding into logistic regression. If we could draw a Venn diagram, we would find stacked models inside the concept of ensemble model. I couldn’t find a good example on Google images, so I applied the magic of MS paint to present a rough illustration for your viewing pleasure:

Ok, art class over and back to coding. We’re going to do a ROC curve of kNN, logistic regression (LR), and k-means feeding into logistic regression.

#Generate test data from same distribution of training data
test_data, test_labels = make_moons(n_samples=2000, noise=0.3, random_state=seed+5)training_cluster_features = kmf_hint.transform(training_data)
test_cluster_features = kmf_hint.transform(test_data)training_with_cluster = scipy.sparse.hstack((training_data, training_cluster_features))
test_with_cluster = scipy.sparse.hstack((test_data, test_cluster_features))#Run the models
lr_cluster = LogisticRegression(random_state=seed).fit(training_with_cluster, training_labels)classifier_names = ['LR',
                    'kNN']
classifiers = [LogisticRegression(random_state=seed),
               KNeighborsClassifier(5)]
for model in classifiers:
    model.fit(training_data, training_labels)   
    
#Plot the ROC
def test_roc(model, data, labels):
    if hasattr(model, "decision_function"):
        predictions = model.decision_function(data)
    else:
        predictions = model.predict_proba(data)[:,1]
    fpr, tpr, _ = sklearn.metrics.roc_curve(labels, predictions)
    return fpr, tprplt.figure()
fpr_cluster, tpr_cluster = test_roc(lr_cluster, test_with_cluster, test_labels)
plt.plot(fpr_cluster, tpr_cluster, 'r-', label='LR with k-means')for i, model in enumerate(classifiers):
    fpr, tpr = test_roc(model, test_data, test_labels)
    plt.plot(fpr, tpr, label=classifier_names[i])
    
plt.plot([0, 1], [0, 1], 'k--')
plt.legend()
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)

Alright, first time I saw a ROC curve, I was like how do I read this thing? Well, what you want is the model that shoots up to the top left corner the fastest. In this case, our most accurate model is the stacked model — linear regression with k-means. The classification that our models worked on is picking where, which data point belongs to the big circle or small circle.ConclusionPhew, we covered quite a few things here. First, we look at nonlinear data and examples that we might face in the real world. Second, we looked over k-means as a tool to discover more features about our data that was not there before. Next, we applied k-means to our own data set. Lastly, we stacked k-means into logistic regression to make a superior model. Pretty cool stuff overall. Somethings to note, we didn’t tune the models, which would change the performance nor did we compare that many models. But combining unsupervised learning into your supervised models could prove pretty useful and help you deliver insights you couldn’t get otherwise!

Disclaimer: All things stated in this article are of my own opinion and not of any employer. Also sprinkled affiliate links.

[1] A, Trevino, Introduction to K-means Clustering (2016),

https://www.datascience.com/blog/k-means-clustering

[2] J, VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data (2016),

https://amzn.to/2SMdZue

[3] Scikit-learn Developers, sklearn.data sets.make_circles (2019),

https://scikit-learn.org/stable/modules/generated/sklearn.data sets.make_circles.html#sklearn.data sets.make_circles

[4] A, Zheng, et al, Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists (2018),

https://amzn.to/2SOFh3q

[5] F, Gunes, Why do stacked ensemble models win data science competitions? (2017),

https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/

#machine-learning #data-science #deep-learning

towardsdatascience.com

Machine Learning Pipelines: Nonlinear Model Stacking