Feature selection is an important task for any machine learning application. This is especially crucial when the data in question has many features. The optimal number of features also leads to improved model accuracy.
Feature selection is an important task for any machine learning application. This is especially crucial when the data in question has many features. The optimal number of features also leads to improved model accuracy. Obtaining the most important features and the number of optimal features can be obtained via feature importance or feature ranking. In this piece, we’ll explore feature ranking.
The first item needed for recursive feature elimination is an estimator; for example, a linear model or a decision tree model.
These models have coefficients for linear models and feature importances in decision tree models. In selecting the optimal number of features, the estimator is trained and the features are selected via the coefficients, or via the feature importances. The least important features are removed. This process is repeated recursively until the optimal number of features is obtained.
Scikit-learn makes it possible to implement recursive feature elimination via the sklearn.feature_selection.**RFE**
class. The class takes the following parameters:
estimator
— a machine learning estimator that can provide features importances via the coef_
or feature_importances_
attributes.n_features_to_select
— the number of features to select. Selects half
if it's not specified.step
— an integer that indicates the number of features to be removed at each iteration, or a number between 0 and 1 to indicate the percentage of features to remove at each iteration.Once fitted, the following attributes can be obtained:
ranking_
— the ranking of the features.n_features_
— the number of features that have been selected.support_
— an array that indicates whether or not a feature was selected.As noted earlier, we’ll need to work with an estimator that offers a feature_importance_s
attribute or a coeff_
attribute. Let’s work through a quick example. The dataset has 13 features—we’ll work on getting the optimal number of features.
import pandas as pd
df = pd.read_csv(‘heart.csv’)
df.head()
Let’s obtain the X
and y
features.
X = df.drop([‘target’],axis=1)
y = df[‘target’]
We’ll split it into a testing and training set to prepare for modeling:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
Let’s get a couple of imports out of the way:
Pipeline
— since we’ll perform some cross-validation. It’s best practice in order to avoid data leakage.RepeatedStratifiedKFold
— for repeated stratified cross-validation.cross_val_score
— for evaluating the score on cross-validation.GradientBoostingClassifier
— the estimator we’ll use.numpy
— so that we can compute the mean of the scores.from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
The first step is to create an instance of the RFE
class while specifying the estimator and the number of features you’d like to select. In this case, we’re selecting 6:
rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=6)
Next, we create an instance of the model we’d like to use:
model = GradientBoostingClassifier()
We’ll use a Pipeline
to transform the data. In the Pipeline
we specify rfe
for the feature selection step and the model that’ll be used in the next step.
We then specify a RepeatedStratifiedKFold
with 10 splits and 5 repeats. The stratified K fold ensures that the number of samples from each class is well balanced in each fold. RepeatedStratifiedKFold
repeats the stratified K fold the specified number of times, with a different randomization in each repetition.
pipe = Pipeline([(‘Feature Selection’, rfe), (‘Model’, model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=36851234)
n_scores = cross_val_score(pipe, X_train, y_train, scoring=’accuracy’, cv=cv, n_jobs=-1)
np.mean(n_scores)
The next step is to fit this pipeline to the dataset.
pipe.fit(X_train, y_train)
With that in place, we can check the support and the ranking. The support indicates whether or not a feature was chosen.
rfe.support_
array([ True, False, True, False, True, False, False, True, False,True, False, True, True])
We can put that into a dataframe and check the result.
pd.DataFrame(rfe.support_,index=X.columns,columns=[‘Rank’])
We can also check the relative rankings.
rf_df = pd.DataFrame(rfe.ranking_,index=X.columns,columns=[‘Rank’]).sort_values(by=’Rank’,ascending=True)
rf_df.head()
programming heartbeat artificial-intelligence data-science-for-ml feature-selection
Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...
Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.
Tools & techniques for handling data when it's imbalanced. As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation.
🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...
There are many intersections and overlaps between AI and data science. AI has numerous subsets, like Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP). With many career opportunities in both fields, there are lots of conflicting perspectives on educational paths for starting a career in one of these fields.