If you visit the Scikit-Learn developer’s guide, you can easily find a breakdown of the objects that they expect you to customize. It includes the Estimator, Predictor, Transformer, and Model classes, and there’s a nice guide walking you through the ins and outs of their APIs.

But if for some (potentially misguided) reason you’ve decided to implement your own subclass of the sklearn.pipeline.Pipeline class, then you’ll be stepping off the marked trail, and you’re going to need your jungle gear: plenty of coffee, the built-in function dir, the pdbmodule, and a pillow to scream into occasionally.

The import

When it comes to naming your subclass, you can give it a custom name if you’re hoping to use it only in new code you’re writing. If, however, you’re hoping to integrate your new class into legacy code, it’s simplest to keep the name the same. Here’s how to do that:

from sklearn.pipeline import Pipeline as SKPipeline

class Pipeline(SKPipeline):
    pass

Aliasing the sklearn Pipeline as SKPipeline allows you to use the identifier Pipeline for your new class

What’s up with attribute and attribute_?

Suppose you want your special Pipeline class to interact with its steps — perhaps you’re interested in writing a Pipeline.stepinfo property. To do this, you’ll need to interact with the attributes of its constituent objects. Many Scikit-Learn objects often contain properties with a trailing underscore such as components_. For some objects, both underscored and the non-underscored attributes exist. For example, ColumnTransformer hastransformers and a transformers_ attributes. Which of these attributes do you want to access, and what’s the difference?

The trailing underscore indicates an attribute that exists in an object after it has been “fitted”. (Yes, “fitted” is how sklearn refers to this state — in error messages that I see in all my nightmares now.) This is an important distinction because fitting often involves a cloning process that creates new objects. Take ColumnTransformer for example. The objects in ColumnTransformer.transformers are actually different objects (they have a different id()) from those in ColumnTransformer.transformers_. If you’re hoping to access data that exists in a “fitted” (so weird) ColumnTransformer, then you need the underscored attribute.

#python #oop #deep learning

Subclassing the Scikit-Learn Pipeline
3.05 GEEK