How to Get Feature Importances from Any Sklearn Pipeline

How to Get Feature Importances from Any Sklearn Pipeline

Pipelines can be hard to navigate here’s some code that works in general. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.

Introduction

Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.

Pipelines

Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.

from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
## Load a dataset and print the first examples in the training set
imdb_data = load_dataset('imdb')
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
model = Pipeline(
    [
        ("vectorizer", TfidfVectorizer()),
        ("classifier", classifier),
    ]
)
x_train = [x["text"]for x in imdb_data["train"]]
y_train = [x["label"]for x in imdb_data["train"]]
model.fit(x_train, y_train)

Here we use the excellent datasets python package to quickly access the imdb sentiment data. This package put together by HuggingFace has a ton of great datasets and they are all ready to go so you can get straight to the fun model building.

The above pipeline defines two steps in a list. It first takes input and passes it through a TfidfVectorizer which takes in text and returns the TF-IDF features of the text as a vector. It then passes that vector to the SVM classifier.

Notice how this happens in order, the TF-IDF step then the classifier. You can chain as many featurization steps as you’d like. For example, the above pipeline is equivalent to:

model = Pipeline(
    [
        ("vectorizer", CountVectorizer()),
        ("transformer", TfidfTransformer()),
        ("classifier", classifier),
    ]
)

Here we do things even more manually. First, we get counts of every word, second, we apply the TF-IDF transformation, and finally, we pass this feature vector to the classifier. The TfidfVectorizer does those two in one step. But this illustrates the point. *In a raw pipeline, things execute in order. *We’ll discuss how to stack features together a little later. For now, let’s work on getting the feature importance for our first example model.

programming machine-learning data-science artificial-intelligence python

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.