A Feature Selection Tool for Machine Learning in Python

A Feature Selection Tool for Machine Learning in Python

A Feature Selection Tool for Machine Learning in Python. Features with a high percentage of missing values. Collinear (highly correlated) features. Features with zero importance in a tree-based model. Features with low importance. Features with a single unique value.

A Feature Selection Tool for Machine Learning in Python. Features with a high percentage of missing values. Collinear (highly correlated) features. Features with zero importance in a tree-based model. Features with low importance. Features with a single unique value.

Using the FeatureSelector for efficient machine learning workflows

Feature selection, the process of finding and selecting the most useful features in a dataset, is a crucial step of the machine learning pipeline. Unnecessary features decrease training speed, decrease model interpretability, and, most importantly, decrease generalization performance on the test set.

Frustrated by the ad-hoc feature selection methods I found myself applying over and over again for machine learning problems, I built a class for feature selection in Python available on GitHub. The FeatureSelector includes some of the most common feature selection methods:

  1. Features with a high percentage of missing values
  2. Collinear (highly correlated) features
  3. Features with zero importance in a tree-based model
  4. Features with low importance
  5. Features with a single unique value

In this article we will walk through using the FeatureSelector on an example machine learning dataset. We’ll see how it allows us to rapidly implement these methods, allowing for a more efficient workflow.

The complete code is available on GitHub and I encourage any contributions. The Feature Selector is a work in progress and will continue to improve based on the community needs!

Example Dataset

For this example, we will use a sample of data from the Home Credit Default Risk machine learning competition on Kaggle. (To get started with the competition, see this article). The entire dataset is available for download and here we will use a sample for illustration purposes.

The competition is a supervised classification problem and this is a good dataset to use because it has many missing values, numerous highly correlated (collinear) features, and a number of irrelevant features that do not help a machine learning model.

Creating an Instance

To create an instance of the FeatureSelector class, we need to pass in a structured dataset with observations in the rows and features in the columns. We can use some of the methods with only features, but the importance-based methods also require training labels. Since we have a supervised classification task, we will use a set of features and a set of labels.

(Make sure to run this in the same directory as feature_selector.py )

from feature_selector import FeatureSelector

# Features are in train and labels are in train_labels
fs = FeatureSelector(data = train, labels = train_labels)

Methods

The feature selector has five methods for finding features to remove. We can access any of the identified features and remove them from the data manually, or use the remove function in the Feature Selector.

Here we will go through each of the identification methods and also show how all 5 can be run at once. The FeatureSelector additionally has several plotting capabilities because visually inspecting data is a crucial component of machine learning.

Missing Values

The first method for finding features to remove is straightforward: find features with a fraction of missing values above a specified threshold. The call below identifies features with more than 60% missing values (bold is output).

fs.identify_missing(missing_threshold = 0.6)

17 features with greater than 0.60 missing values.

We can see the fraction of missing values in every column in a dataframe:

fs.missing_stats.head()

To see the features identified for removal, we access the ops attribute of the FeatureSelector , a Python dict with features as lists in the values.

missing_features = fs.ops['missing']
missing_features[:5]

['OWN_CAR_AGE',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'FLOORSMIN_AVG',
 'LIVINGAPARTMENTS_AVG']

Finally, we have a plot of the distribution of missing values in all feature:

fs.plot_missing()

Collinear Features

Collinear features are features that are highly correlated with one another. In machine learning, these lead to decreased generalization performance on the test set due to high variance and less model interpretability.

The identify_collinear method finds collinear features based on a specified correlation coefficient value. For each pair of correlated features, it identifies one of the features for removal (since we only need to remove one):

fs.identify_collinear(correlation_threshold = 0.98)

21 features with a correlation magnitude greater than 0.98.

A neat visualization we can make with correlations is a heatmap. This shows all the features that have at least one correlation above the threshold:

fs.plot_collinear()

As before, we can access the entire list of correlated features that will be removed, or see the highly correlated pairs of features in a dataframe.

# list of collinear features to remove
collinear_features = fs.ops['collinear']

# dataframe of collinear features
fs.record_collinear.head()

If we want to investigate our dataset, we can also make a plot of all the correlations in the data by passing in plot_all = True to the call:

Zero Importance Features

The previous two methods can be applied to any structured dataset and are **deterministic **— the results will be the same every time for a given threshold. The next method is designed only for supervised machine learning problems where we have labels for training a model and is non-deterministic. The identify_zero_importance function finds features that have zero importance according to a gradient boosting machine (GBM) learning model.

With tree-based machine learning models, such as a boosting ensemble, we can find feature importances. The absolute value of the importance is not as important as the relative values, which we can use to determine the most relevant features for a task. We can also use feature importances for feature selection by removing zero importance features. In a tree-based model, the features with zero importance are not used to split any nodes, and so we can remove them without affecting model performance.

The FeatureSelector finds feature importances using the gradient boosting machine from the LightGBM library. The feature importances are averaged over 10 training runs of the GBM in order to reduce variance. Also, the model is trained using early stopping with a validation set (there is an option to turn this off) to prevent overfitting to the training data.

The code below calls the method and extracts the zero importance features:

# Pass in the appropriate parameters
fs.identify_zero_importance(task = 'classification', 
                            eval_metric = 'auc', 
                            n_iterations = 10, 
                             early_stopping = True)

# list of zero importance features
zero_importance_features = fs.ops['zero_importance']

63 features with zero importance after one-hot encoding.

The parameters we pass in are as follows:

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

This time we get two plots with plot_feature_importances:

# plot the feature importances
fs.plot_feature_importances(threshold = 0.99, plot_n = 12)

124 features required for 0.99 of cumulative importance

On the left we have the plot_n most important features (plotted in terms of normalized importance where the total sums to 1). On the right we have the cumulative importance versus the number of features. The vertical line is drawn at threshold of the cumulative importance, in this case 99%.

Two notes are good to remember for the importance-based methods:

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

This should not have a major impact (the most important features will not suddenly become the least) but it will change the ordering of some of the features. It also can affect the number of zero importance features identified. Don’t be surprised if the feature importances change every time!

  • task : either “classification” or “regression” corresponding to our problem
  • eval_metric: metric to use for early stopping (not necessary if early stopping is disabled)
  • n_iterations : number of training runs to average the feature importances over
  • early_stopping: whether or not use early stopping for training the model

When we get to the feature removal stage, there is an option to remove any added one-hot encoded features. However, if we are doing machine learning after feature selection, we will have to one-hot encode the features anyway!

Low Importance Features

The next method builds on zero importance function, using the feature importances from the model for further selection. The function identify_low_importance finds the lowest importance features that do not contribute to a specified total importance.

For example, the call below finds the least important features that are not required for achieving 99% of the total importance:

fs.identify_low_importance(cumulative_importance = 0.99)

123 features required for cumulative importance of 0.99 after one hot encoding.
116 features do not contribute to cumulative importance of 0.99.

Based on the plot of cumulative importance and this information, the gradient boosting machine considers many of the features to be irrelevant for learning. Again, the results of this method will change on each training run.

To view all the feature importances in a dataframe:

fs.feature_importances.head(10)

The low_importance method borrows from one of the methods of using Principal Components Analysis (PCA) where it is common to keep only the PC needed to retain a certain percentage of the variance (such as 95%). The percentage of total importance accounted for is based on the same idea.

The feature importance based methods are really only applicable if we are going to use a tree-based model for making predictions. Besides being stochastic, the importance-based methods are a black-box approach in that we don’t really know why the model considers the features to be irrelevant. If using these methods, run them several times to see how the results change, and perhaps create multiple datasets with different parameters to test!

Single Unique Value Features

The final method is fairly basic: find any columns that have a single unique value. A feature with only one unique value cannot be useful for machine learning because this feature has zero variance. For example, a tree-based model can never make a split on a feature with only one value (since there are no groups to divide the observations into).

There are no parameters here to select, unlike the other methods:

fs.identify_single_unique()

4 features with a single unique value.

We can plot a histogram of the number of unique values in each category:

fs.plot_unique()

One point to remember is NaNs are dropped before calculating unique values in Pandas by default.

Removing Features

Once we’ve identified the features to discard, we have two options for removing them. All of the features to remove are stored in the ops dict of the FeatureSelector and we can use the lists to remove features manually. Another option is to use the remove built-in function.

For this method, we pass in the methods to use to remove features. If we want to use all the methods implemented, we just pass in methods = 'all'.

# Remove the features from all methods (returns a df)
train_removed = fs.remove(methods = 'all')

['missing', 'single_unique', 'collinear', 'zero_importance', 'low_importance'] methods have been run

Removed 140 features.

This method returns a dataframe with the features removed. To also remove the one-hot encoded features that are created during machine learning:

train_removed_all = fs.remove(methods = 'all', keep_one_hot=False)

Removed 187 features including one-hot features.

It might be a good idea to check the features that will be removed before going ahead with the operation! The original dataset is stored in the data attribute of the FeatureSelector as a back-up!

Running all Methods at Once

Rather than using the methods individually, we can use all of them with identify_all. This takes a dictionary of the parameters for each method:

fs.identify_all(selection_params = {'missing_threshold': 0.6,    
                                    'correlation_threshold': 0.98, 
                                    'task': 'classification',    
                                    'eval_metric': 'auc', 
                                    'cumulative_importance': 0.99})

151 total features out of 255 identified for removal after one-hot encoding.

Notice that the number of total features will change because we re-ran the model. The remove function can then be called to discard these features.

Conclusions

The Feature Selector class implements several common operations for removing features before training a machine learning model. It offers functions for identifying features for removal as well as visualizations. Methods can be run individually or all at once for efficient workflows.

The missing, collinear, and single_unique methods are deterministic while the feature importance-based methods will change with each run. Feature selection, much like the field of machine learning, is largely empirical and requires testing multiple combinations to find the optimal answer. It’s best practice to try several configurations in a pipeline, and the Feature Selector offers a way to rapidly evaluate parameters for feature selection.

Originally published by Will Koehrsen* at *https://towardsdatascience.com

Learn More

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Data Science, Deep Learning, & Machine Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

Complete Python Bootcamp: Go from zero to hero in Python 3

Complete Python Masterclass

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks

Explore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

Python Tutorial - Learn Python for Machine Learning and Web Development

Python Tutorial - Learn Python for Machine Learning and Web Development

Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Python tutorial for beginners - Learn Python for Machine Learning and Web Development

TABLE OF CONTENT

  • 00:00:00 Introduction
  • 00:01:49 Installing Python 3
  • 00:06:10 Your First Python Program
  • 00:08:11 How Python Code Gets Executed
  • 00:11:24 How Long It Takes To Learn Python
  • 00:13:03 Variables
  • 00:18:21 Receiving Input
  • 00:22:16 Python Cheat Sheet
  • 00:22:46 Type Conversion
  • 00:29:31 Strings
  • 00:37:36 Formatted Strings
  • 00:40:50 String Methods
  • 00:48:33 Arithmetic Operations
  • 00:51:33 Operator Precedence
  • 00:55:04 Math Functions
  • 00:58:17 If Statements
  • 01:06:32 Logical Operators
  • 01:11:25 Comparison Operators
  • 01:16:17 Weight Converter Program
  • 01:20:43 While Loops
  • 01:24:07 Building a Guessing Game
  • 01:30:51 Building the Car Game
  • 01:41:48 For Loops
  • 01:47:46 Nested Loops
  • 01:55:50 Lists
  • 02:01:45 2D Lists
  • 02:05:11 My Complete Python Course
  • 02:06:00 List Methods
  • 02:13:25 Tuples
  • 02:15:34 Unpacking
  • 02:18:21 Dictionaries
  • 02:26:21 Emoji Converter
  • 02:30:31 Functions
  • 02:35:21 Parameters
  • 02:39:24 Keyword Arguments
  • 02:44:45 Return Statement
  • 02:48:55 Creating a Reusable Function
  • 02:53:42 Exceptions
  • 02:59:14 Comments
  • 03:01:46 Classes
  • 03:07:46 Constructors
  • 03:14:41 Inheritance
  • 03:19:33 Modules
  • 03:30:12 Packages
  • 03:36:22 Generating Random Values
  • 03:44:37 Working with Directories
  • 03:50:47 Pypi and Pip
  • 03:55:34 Project 1: Automation with Python
  • 04:10:22 Project 2: Machine Learning with Python
  • 04:58:37 Project 3: Building a Website with Django

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium

Machine Learning Full Course - Learn Machine Learning

Machine Learning Full Course - Learn Machine Learning

This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.

Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial

It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.

Below topics are explained in this Machine Learning course for beginners:

  1. Basics of Machine Learning - 01:46

  2. Why Machine Learning - 09:18

  3. What is Machine Learning - 13:25

  4. Types of Machine Learning - 18:32

  5. Supervised Learning - 18:44

  6. Reinforcement Learning - 21:06

  7. Supervised VS Unsupervised - 22:26

  8. Linear Regression - 23:38

  9. Introduction to Machine Learning - 25:08

  10. Application of Linear Regression - 26:40

  11. Understanding Linear Regression - 27:19

  12. Regression Equation - 28:00

  13. Multiple Linear Regression - 35:57

  14. Logistic Regression - 55:45

  15. What is Logistic Regression - 56:04

  16. What is Linear Regression - 59:35

  17. Comparing Linear & Logistic Regression - 01:05:28

  18. What is K-Means Clustering - 01:26:20

  19. How does K-Means Clustering work - 01:38:00

  20. What is Decision Tree - 02:15:15

  21. How does Decision Tree work - 02:25:15 

  22. Random Forest Tutorial - 02:39:56

  23. Why Random Forest - 02:41:52

  24. What is Random Forest - 02:43:21

  25. How does Decision Tree work- 02:52:02

  26. K-Nearest Neighbors Algorithm Tutorial - 03:22:02

  27. Why KNN - 03:24:11

  28. What is KNN - 03:24:24

  29. How do we choose 'K' - 03:25:38

  30. When do we use KNN - 03:27:37

  31. Applications of Support Vector Machine - 03:48:31

  32. Why Support Vector Machine - 03:48:55

  33. What Support Vector Machine - 03:50:34

  34. Advantages of Support Vector Machine - 03:54:54

  35. What is Naive Bayes - 04:13:06

  36. Where is Naive Bayes used - 04:17:45

  37. Top 10 Application of Machine Learning - 04:54:48

  38. How to become a Machine Learning Engineer - 04:59:46

  39. Machine Learning Interview Questions - 05:09:03