'Sklearn Pipelines: how to get a dataframe after FeatureUnion
I try to make some preprocess with my data.
I need to create some preprocess for all of my data and then divide for the future preprocess and then union it again.
I use this pipeline:
converter = Pipeline(
[
('input_preproc', MainPreprocessing()),
('feature_union', FeatureUnion(
[
('main_columns', ColumnSelector(UPDATED_CAT_FEATURES)),
('tfidf_hunts', ColumnTransformer([("tfidf", DenseTfidfVectorizer(), 'col_name')]))
]
)
)
Where ColumnSelector is
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns]
and DenseTfidfVectorizer is
class DenseTfidfVectorizer(TfidfVectorizer):
def transform(self, raw_documents, copy=True):
X = super().transform(raw_documents, copy=copy)
df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
return df
def fit_transform(self, raw_documents, y=None):
X = super().fit_transform(raw_documents, y=y)
df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
return df
So the main idea to use input_preproc at the first step, then extract one column for tfidf and then union dataframe results with all the other features excluding the feature for tfidf.
And for this kind of dataset
test = pd.DataFrame({
'a': [1, 2, 3, 4, 5],
'text': ['123 qwe asd', '234 wer sdf', '123 wer sdf', '345 wer asd', 'zxc asd qwe']
})
I got just array, not DataFrame
array([[1. , 0.60981846, 0. , 0. , 0.50620441,
0.60981846, 0. , 0. , 0. ],
[2. , 0. , 0.69015927, 0. , 0. ,
0. , 0.55681615, 0.4622077 , 0. ],
[3. , 0.60981846, 0. , 0. , 0. ,
0. , 0.60981846, 0.50620441, 0. ],
[4. , 0. , 0. , 0.72604443, 0.48624042,
0. , 0. , 0.48624042, 0. ],
[5. , 0. , 0. , 0. , 0.4622077 ,
0.55681615, 0. , 0. , 0.69015927]])
So How can I get the DataFrame in the output?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
