'Creating a custom transformer in scikit-learn that adds cluster labels
I am writing a custom transformer in scikit-learn that adds cluster labels as a new column using stock KMeans to pandas dataframe. The custom transformer should fit to existing data then transform the unseen data by adding the a new column with the index name 'Cluster' and return a new dataframe with the additional column without modifying the original dataframe. Below is the code that that I came up with:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
self.model = KMeans(n_clusters = self.clusters)
def fit(self, X):
self.X=X
self.model.fit (self.X)
return self.model
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.transform(self.X_).labels_
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
cluster_enc_tr_data
Unfortunately the code does work properly. The result is a dataframe with cluster numbers as column indices, with row numbers and unknown previously values. Any help or tips will greatly be appreciated.
Update 23 of June 21 v2: Please see below the code after implementing Ben's revised comments. It works perfectly now.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
def fit(self, X):
self.X=X
self.model = KMeans(n_clusters = self.clusters)
self.model.fit (self.X)
return self
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.predict(X_)
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
Solution 1:[1]
The fit method must always return self.
The problem here is that fit_transform(X, y), inherited from TransformerMixin, is just fit(X, y).transform(X); your fit now returns the underlying KMeans transformer, and that is used to transform X instead of your transform.
A few more notes though:
KMeans.transformgives the cluster-distance matrix, but you want the cluster labels. Usepredictinstead. And droplabels_, so justX_['Clusters'] = self.model.predict(X_).)__init__should only set attributes that appear in its signature, in order for cloning to work (required for e.g. hyperparameter searches). You can defineself.modelatfittime.in
transform, you useself.X_but it is never defined; I guess you mean justX_. There no real reason to saveXat fit time either;self.Xis never really needed?This will only work on dataframes; that may not be a problem for you, but keep it in mind. (You can't use this as a step in a pipeline after builtin
sklearntransformers, because those will return numpy arrays.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
