'Including unlabelled data in sklearn pipeline
I'm setting up a machine learning pipeline to classify some data. I have lots of unlabelled data (i.e. target variable is unknown) that I would like to make use of. One of the ways I would like to do this is to use the unlabelled data to fit the transformers in my pipeline. For example, for the variables I am scaling when StandardScaler is called I want it to fit on the given training data plus the unlabelled data and then transform the training data.
For clarity, outside of a pipeline I can implement it like this:
all_data = pd.concat([labelled_data, unlabelled_data])
s_scaler = StandardScaler()
s_scaler.fit(all_data)
scaled_labelled_df = s_scaler.transform(labelled_data)
Is there a way of implementing this in the sklearn pipeline? I've had a look at the FunctionTransformer method but don't understand how I could use it in this case.
Solution 1:[1]
Defining a new class which inherits from the desired transformer with a modified fit method should do the trick e.g.
class StandardScaleWULD(StandardScaler):
def __init__(self):
super().__init__()
self.unlabelled_data = UNLABELLED_TRAITS
def fit(self, X, y=None, sample_weight=None):
all_data = pd.concat([X, self.unlabelled_data])
super().fit(all_data, y, sample_weight)
this new transformer can then be used in the pipeline as usual.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | A. Bollans |
