A Feature Selection Tool for Machine Learning in Python

A Feature Selection Tool for Machine Learning in Python. Features with a high percentage of missing values. Collinear (highly correlated) features. Features with zero importance in a tree-based model. Features with low importance. Features with a single unique value.

Using the FeatureSelector for efficient machine learning workflows

Feature selection, the process of finding and selecting the most useful features in a dataset, is a crucial step of the machine learning pipeline. Unnecessary features decrease training speed, decrease model interpretability, and, most importantly, decrease generalization performance on the test set.

Frustrated by the ad-hoc feature selection methods I found myself applying over and over again for machine learning problems, I built a class for feature selection in Python available on GitHub. The FeatureSelector includes some of the most common feature selection methods:

  1. Features with a high percentage of missing values
  2. Collinear (highly correlated) features
  3. Features with zero importance in a tree-based model
  4. Features with low importance
  5. Features with a single unique value

In this article we will walk through using the FeatureSelector on an example machine learning dataset. We’ll see how it allows us to rapidly implement these methods, allowing for a more efficient workflow.

The complete code is available on GitHub and I encourage any contributions. The Feature Selector is a work in progress and will continue to improve based on the community needs!

Example Dataset

For this example, we will use a sample of data from the Home Credit Default Risk machine learning competition on Kaggle. (To get started with the competition, see this article). The entire dataset is available for download and here we will use a sample for illustration purposes.

The competition is a supervised classification problem and this is a good dataset to use because it has many missing values, numerous highly correlated (collinear) features, and a number of irrelevant features that do not help a machine learning model.

Creating an Instance

To create an instance of the FeatureSelector class, we need to pass in a structured dataset with observations in the rows and features in the columns. We can use some of the methods with only features, but the importance-based methods also require training labels. Since we have a supervised classification task, we will use a set of features and a set of labels.

(Make sure to run this in the same directory as feature_selector.py )

from feature_selector import FeatureSelector

# Features are in train and labels are in train_labels
fs = FeatureSelector(data = train, labels = train_labels)

Methods

The feature selector has five methods for finding features to remove. We can access any of the identified features and remove them from the data manually, or use the remove function in the Feature Selector.

Here we will go through each of the identification methods and also show how all 5 can be run at once. The FeatureSelector additionally has several plotting capabilities because visually inspecting data is a crucial component of machine learning.

Missing Values

The first method for finding features to remove is straightforward: find features with a fraction of missing values above a specified threshold. The call below identifies features with more than 60% missing values (bold is output).

fs.identify_missing(missing_threshold = 0.6)

17 features with greater than 0.60 missing values.

We can see the fraction of missing values in every column in a dataframe:

fs.missing_stats.head()

To see the features identified for removal, we access the ops attribute of the FeatureSelector , a Python dict with features as lists in the values.

missing_features = fs.ops['missing']
missing_features[:5]

['OWN_CAR_AGE',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'FLOORSMIN_AVG',
 'LIVINGAPARTMENTS_AVG']

Finally, we have a plot of the distribution of missing values in all feature:

fs.plot_missing()

Collinear Features

Collinear features are features that are highly correlated with one another. In machine learning, these lead to decreased generalization performance on the test set due to high variance and less model interpretability.

The identify_collinear method finds collinear features based on a specified correlation coefficient value. For each pair of correlated features, it identifies one of the features for removal (since we only need to remove one):

fs.identify_collinear(correlation_threshold = 0.98)

21 features with a correlation magnitude greater than 0.98.

A neat visualization we can make with correlations is a heatmap. This shows all the features that have at least one correlation above the threshold:

fs.plot_collinear()

As before, we can access the entire list of correlated features that will be removed, or see the highly correlated pairs of features in a dataframe.

# list of collinear features to remove
collinear_features = fs.ops['collinear']

# dataframe of collinear features
fs.record_collinear.head()

If we want to investigate our dataset, we can also make a plot of all the correlations in the data by passing in plot_all = True to the call:

Zero Importance Features

The previous two methods can be applied to any structured dataset and are **deterministic **— the results will be the same every time for a given threshold. The next method is designed only for supervised machine learning problems where we have labels for training a model and is non-deterministic. The identify_zero_importance function finds features that have zero importance according to a gradient boosting machine (GBM) learning model.

With tree-based machine learning models, such as a boosting ensemble, we can find feature importances. The absolute value of the importance is not as important as the relative values, which we can use to determine the most relevant features for a task. We can also use feature importances for feature selection by removing zero importance features. In a tree-based model, the features with zero importance are not used to split any nodes, and so we can remove them without affecting model performance.

The FeatureSelector finds feature importances using the gradient boosting machine from the LightGBM library. The feature importances are averaged over 10 training runs of the GBM in order to reduce variance. Also, the model is trained using early stopping with a validation set (there is an option to turn this off) to prevent overfitting to the training data.

The code below calls the method and extracts the zero importance features:

# Pass in the appropriate parameters
fs.identify_zero_importance(task = 'classification', 
                            eval_metric = 'auc', 
                            n_iterations = 10, 
                             early_stopping = True)

# list of zero importance features
zero_importance_features = fs.ops['zero_importance']

63 features with zero importance after one-hot encoding.

The parameters we pass in are as follows:

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

This time we get two plots with plot_feature_importances:

# plot the feature importances
fs.plot_feature_importances(threshold = 0.99, plot_n = 12)

124 features required for 0.99 of cumulative importance

On the left we have the plot_n most important features (plotted in terms of normalized importance where the total sums to 1). On the right we have the cumulative importance versus the number of features. The vertical line is drawn at threshold of the cumulative importance, in this case 99%.

Two notes are good to remember for the importance-based methods:

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

This should not have a major impact (the most important features will not suddenly become the least) but it will change the ordering of some of the features. It also can affect the number of zero importance features identified. Don’t be surprised if the feature importances change every time!

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

When we get to the feature removal stage, there is an option to remove any added one-hot encoded features. However, if we are doing machine learning after feature selection, we will have to one-hot encode the features anyway!

Low Importance Features

The next method builds on zero importance function, using the feature importances from the model for further selection. The function identify_low_importance finds the lowest importance features that do not contribute to a specified total importance.

For example, the call below finds the least important features that are not required for achieving 99% of the total importance:

fs.identify_low_importance(cumulative_importance = 0.99)

123 features required for cumulative importance of 0.99 after one hot encoding.
116 features do not contribute to cumulative importance of 0.99.

Based on the plot of cumulative importance and this information, the gradient boosting machine considers many of the features to be irrelevant for learning. Again, the results of this method will change on each training run.

To view all the feature importances in a dataframe:

fs.feature_importances.head(10)

The low_importance method borrows from one of the methods of using Principal Components Analysis (PCA) where it is common to keep only the PC needed to retain a certain percentage of the variance (such as 95%). The percentage of total importance accounted for is based on the same idea.

The feature importance based methods are really only applicable if we are going to use a tree-based model for making predictions. Besides being stochastic, the importance-based methods are a black-box approach in that we don’t really know why the model considers the features to be irrelevant. If using these methods, run them several times to see how the results change, and perhaps create multiple datasets with different parameters to test!

Single Unique Value Features

The final method is fairly basic: find any columns that have a single unique value. A feature with only one unique value cannot be useful for machine learning because this feature has zero variance. For example, a tree-based model can never make a split on a feature with only one value (since there are no groups to divide the observations into).

There are no parameters here to select, unlike the other methods:

fs.identify_single_unique()

4 features with a single unique value.

We can plot a histogram of the number of unique values in each category:

fs.plot_unique()

One point to remember is NaNs are dropped before calculating unique values in Pandas by default.

Removing Features

Once we’ve identified the features to discard, we have two options for removing them. All of the features to remove are stored in the ops dict of the FeatureSelector and we can use the lists to remove features manually. Another option is to use the remove built-in function.

For this method, we pass in the methods to use to remove features. If we want to use all the methods implemented, we just pass in methods = 'all'.

# Remove the features from all methods (returns a df)
train_removed = fs.remove(methods = 'all')

['missing', 'single_unique', 'collinear', 'zero_importance', 'low_importance'] methods have been run

Removed 140 features.

This method returns a dataframe with the features removed. To also remove the one-hot encoded features that are created during machine learning:

train_removed_all = fs.remove(methods = 'all', keep_one_hot=False)

Removed 187 features including one-hot features.

It might be a good idea to check the features that will be removed before going ahead with the operation! The original dataset is stored in the data attribute of the FeatureSelector as a back-up!

Running all Methods at Once

Rather than using the methods individually, we can use all of them with identify_all. This takes a dictionary of the parameters for each method:

fs.identify_all(selection_params = {'missing_threshold': 0.6,    
                                    'correlation_threshold': 0.98, 
                                    'task': 'classification',    
                                    'eval_metric': 'auc', 
                                    'cumulative_importance': 0.99})

151 total features out of 255 identified for removal after one-hot encoding.

Notice that the number of total features will change because we re-ran the model. The remove function can then be called to discard these features.

Conclusions

The Feature Selector class implements several common operations for removing features before training a machine learning model. It offers functions for identifying features for removal as well as visualizations. Methods can be run individually or all at once for efficient workflows.

The missing, collinear, and single_unique methods are deterministic while the feature importance-based methods will change with each run. Feature selection, much like the field of machine learning, is largely empirical and requires testing multiple combinations to find the optimal answer. It’s best practice to try several configurations in a pipeline, and the Feature Selector offers a way to rapidly evaluate parameters for feature selection.

#python #machine-learning

What is GEEK

Buddha Community

A Feature Selection Tool for Machine Learning in Python
Ray  Patel

Ray Patel

1625843760

Python Packages in SQL Server – Get Started with SQL Server Machine Learning Services

Introduction

When installing Machine Learning Services in SQL Server by default few Python Packages are installed. In this article, we will have a look on how to get those installed python package information.

Python Packages

When we choose Python as Machine Learning Service during installation, the following packages are installed in SQL Server,

  • revoscalepy – This Microsoft Python package is used for remote compute contexts, streaming, parallel execution of rx functions for data import and transformation, modeling, visualization, and analysis.
  • microsoftml – This is another Microsoft Python package which adds machine learning algorithms in Python.
  • Anaconda 4.2 – Anaconda is an opensource Python package

#machine learning #sql server #executing python in sql server #machine learning using python #machine learning with sql server #ml in sql server using python #python in sql server ml #python packages #python packages for machine learning services #sql server machine learning services

Ray  Patel

Ray Patel

1619518440

top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Ray  Patel

Ray Patel

1619643600

Top Machine Learning Projects in Python For Beginners [2021]

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

Top Machine Learning Projects in Python For Beginners [2021] | upGrad blog

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

The Iris Dataset: For the Beginners

The Iris dataset is easily one of the most popular machine learning projects in Python. It is relatively small, but its simplicity and compact size make it perfect for beginners. If you haven’t worked on any machine learning projects in Python, you should start with it. The Iris dataset is a collection of flower sepal and petal sizes of the flower Iris. It has three classes, with 50 instances in every one of them.

We’ve provided sample code on various places, but you should only use it to understand how it works. Implementing the code without understanding it would fail the premise of doing the project. So be sure to understand the code well before implementing it.

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

sophia tondon

sophia tondon

1620898103

5 Latest Technology Trends of Machine Learning for 2021

Check out the 5 latest technologies of machine learning trends to boost business growth in 2021 by considering the best version of digital development tools. It is the right time to accelerate user experience by bringing advancement in their lifestyle.

#machinelearningapps #machinelearningdevelopers #machinelearningexpert #machinelearningexperts #expertmachinelearningservices #topmachinelearningcompanies #machinelearningdevelopmentcompany

Visit Blog- https://www.xplace.com/article/8743

#machine learning companies #top machine learning companies #machine learning development company #expert machine learning services #machine learning experts #machine learning expert