'Including unlabelled data in sklearn pipeline

I'm setting up a machine learning pipeline to classify some data. I have lots of unlabelled data (i.e. target variable is unknown) that I would like to make use of. One of the ways I would like to do this is to use the unlabelled data to fit the transformers in my pipeline. For example, for the variables I am scaling when StandardScaler is called I want it to fit on the given training data plus the unlabelled data and then transform the training data.

For clarity, outside of a pipeline I can implement it like this:

    all_data =  pd.concat([labelled_data, unlabelled_data])

    s_scaler = StandardScaler()
    s_scaler.fit(all_data)
    scaled_labelled_df = s_scaler.transform(labelled_data)

Is there a way of implementing this in the sklearn pipeline? I've had a look at the FunctionTransformer method but don't understand how I could use it in this case.

Solution 1:^[1]

Defining a new class which inherits from the desired transformer with a modified fit method should do the trick e.g.

class StandardScaleWULD(StandardScaler):
    def __init__(self):
        super().__init__()
        self.unlabelled_data = UNLABELLED_TRAITS

    def fit(self, X, y=None, sample_weight=None):
        all_data = pd.concat([X, self.unlabelled_data])
        super().fit(all_data, y, sample_weight)

this new transformer can then be used in the pipeline as usual.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	A. Bollans

'Including unlabelled data in sklearn pipeline

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]