'Sklearn Pipelines: how to get a dataframe after FeatureUnion

I try to make some preprocess with my data.

I need to create some preprocess for all of my data and then divide for the future preprocess and then union it again.

I use this pipeline:

converter = Pipeline(
    [
        ('input_preproc', MainPreprocessing()),
        ('feature_union', FeatureUnion(
            [
                ('main_columns', ColumnSelector(UPDATED_CAT_FEATURES)),
                ('tfidf_hunts', ColumnTransformer([("tfidf", DenseTfidfVectorizer(), 'col_name')]))
            ]
        )
)

Where ColumnSelector is

class ColumnSelector(BaseEstimator, TransformerMixin):

    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]

and DenseTfidfVectorizer is

class DenseTfidfVectorizer(TfidfVectorizer):

    def transform(self, raw_documents, copy=True):
        X = super().transform(raw_documents, copy=copy)
        df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
        return df

    def fit_transform(self, raw_documents, y=None):
        X = super().fit_transform(raw_documents, y=y)
        df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
        return df

So the main idea to use input_preproc at the first step, then extract one column for tfidf and then union dataframe results with all the other features excluding the feature for tfidf.

And for this kind of dataset

test = pd.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'text': ['123 qwe asd', '234 wer sdf', '123 wer sdf', '345 wer asd', 'zxc asd qwe']
})

I got just array, not DataFrame

array([[1.        , 0.60981846, 0.        , 0.        , 0.50620441,
    0.60981846, 0.        , 0.        , 0.        ],
   [2.        , 0.        , 0.69015927, 0.        , 0.        ,
    0.        , 0.55681615, 0.4622077 , 0.        ],
   [3.        , 0.60981846, 0.        , 0.        , 0.        ,
    0.        , 0.60981846, 0.50620441, 0.        ],
   [4.        , 0.        , 0.        , 0.72604443, 0.48624042,
    0.        , 0.        , 0.48624042, 0.        ],
   [5.        , 0.        , 0.        , 0.        , 0.4622077 ,
    0.55681615, 0.        , 0.        , 0.69015927]])

So How can I get the DataFrame in the output?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source