'Error code: "['cluster'] not found in axis" in K-means algorithm in pandas, in Silhouette index calculation

I am trying to reproduce the K-means code given in the following article (https://www.alessiovaccaro.com/resources/kmeans.php).

import pandas as pd
import numpy as np

import seaborn as sns                             # For data visualization and specifically for pairplot()
import matplotlib.pyplot as plt                   # For data visualization

from sklearn.preprocessing import StandardScaler  # To transform the dataset
from sklearn.cluster import KMeans                # To instantiate, train and use model
from sklearn import metrics                       # For Model Evaluation


#Input nuvola di punti e assegnazione
point_cloud= np.loadtxt(r'D:/Tesi/K-means/Nuvola_prova.txt',skiprows=2)

#Creazione dataframe
# dataframe=pd.DataFrame(list(zip(point_cloud[:,0],point_cloud[:,1],point_cloud[:,2],point_cloud[:,3],
#                                 point_cloud[:,4], point_cloud[:,5], point_cloud[:,11], point_cloud[:,13], 
#                                 point_cloud[:,14])), columns=['X', 'Y','Z','R', 'G', 'B', 'Intensity', 'Density', 'NCR'])
dataframe=pd.DataFrame(list(zip(point_cloud[:,3], point_cloud[:,4], point_cloud[:,5])), 
                       columns=['R', 'G', 'B'])
#Calcolo matrice di correlazione
dataframe.corr()
plt.figure(figsize = (15,6))
sns.heatmap( dataframe.corr(), annot=True)

#Pairplot (per vedere a grafico tutti gli accoppiamenti)
sns.pairplot(dataframe)
plt.figure(figsize = (45,18))
plt.show()

#Boxplot (distribuzione di più variabili contemporaneamente)
plt.figure(figsize = (15,4))
sns.boxplot(data = dataframe, orient = "h")
plt.show()

#Standardizzazione del dataset per evitare problemi di visualizzazione
scaler = StandardScaler()
scaled_array = scaler.fit_transform(dataframe)
scaled_dataframe = pd.DataFrame( scaled_array, columns = dataframe.columns )

plt.figure(figsize = (15,4))
sns.boxplot(data = scaled_dataframe, orient = "h")
plt.show()

scaled_dataframe.describe()


#Modellazione con K-means
#Decisione del numero di cluster di partenza
kmeans_model = KMeans(n_clusters = 4)

#Fitting
kmeans_model.fit(scaled_dataframe)

centroids = kmeans_model.cluster_centers_
centroids
kmeans_model.cluster_centers_.shape
#Visualizzazione risultati
kmeans_model.labels_
dataframe["cluster"] = kmeans_model.labels_
dataframe

sns.pairplot(data = dataframe, hue = "cluster", palette = "Accent_r")
plt.show()

#Ricerca del numero ottimale di cluster tramite indice di silhouette
#Si itera da 2 a 100 cluster e si salva l'indice in un dizionario
k_to_test = range(2,25,1) # [2,3,4, ..., 24]
silhouette_scores = {}

for k in k_to_test:
    model_kmeans_k = KMeans( n_clusters = k )
    model_kmeans_k.fit(scaled_dataframe.drop("cluster", axis = 1))
    labels_k = model_kmeans_k.labels_
    score_k = metrics.silhouette_score(scaled_dataframe.drop("cluster", axis=1), labels_k)
    silhouette_scores[k] = score_k
    print("Tested kMeans with k = %d\tSS: %5.4f" % (k, score_k))
    
print("Done!")

Unfortunately, at the moment when the iterations to calculate the optimal number of clusters should start it returns the following error:

  File "D:\Tesi\K-means\Main.py", line 73, in <module>
    model_kmeans_k.fit(scaled_dataframe.drop("cluster", axis = 1))

  File "D:\Anaconda\envs\py39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\frame.py", line 4906, in drop
    return super().drop(

  File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\generic.py", line 4150, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)

  File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\generic.py", line 4185, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)

  File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\indexes\base.py", line 6017, in drop
    raise KeyError(f"{labels[mask]} not found in axis")

KeyError: "['cluster'] not found in axis"

Searching on the internet I could not find a solution. How can I solve this problem?



Solution 1:[1]

KeyError: "['cluster'] not found in axis" is from scaled_dataframe.drop("cluster", axis = 1) which means that scaled_dataframe does not have a column called 'cluster'.

Your code says you only added such column to dataframe and you added it after scaled_dataframe was created, so you never added such column to scaled_dataframe in your code.

You can print the list of the columns to verify

print(scaled_dataframe.columns)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raymond Kwok