'Error code: "['cluster'] not found in axis" in K-means algorithm in pandas, in Silhouette index calculation
I am trying to reproduce the K-means code given in the following article (https://www.alessiovaccaro.com/resources/kmeans.php).
import pandas as pd
import numpy as np
import seaborn as sns # For data visualization and specifically for pairplot()
import matplotlib.pyplot as plt # For data visualization
from sklearn.preprocessing import StandardScaler # To transform the dataset
from sklearn.cluster import KMeans # To instantiate, train and use model
from sklearn import metrics # For Model Evaluation
#Input nuvola di punti e assegnazione
point_cloud= np.loadtxt(r'D:/Tesi/K-means/Nuvola_prova.txt',skiprows=2)
#Creazione dataframe
# dataframe=pd.DataFrame(list(zip(point_cloud[:,0],point_cloud[:,1],point_cloud[:,2],point_cloud[:,3],
# point_cloud[:,4], point_cloud[:,5], point_cloud[:,11], point_cloud[:,13],
# point_cloud[:,14])), columns=['X', 'Y','Z','R', 'G', 'B', 'Intensity', 'Density', 'NCR'])
dataframe=pd.DataFrame(list(zip(point_cloud[:,3], point_cloud[:,4], point_cloud[:,5])),
columns=['R', 'G', 'B'])
#Calcolo matrice di correlazione
dataframe.corr()
plt.figure(figsize = (15,6))
sns.heatmap( dataframe.corr(), annot=True)
#Pairplot (per vedere a grafico tutti gli accoppiamenti)
sns.pairplot(dataframe)
plt.figure(figsize = (45,18))
plt.show()
#Boxplot (distribuzione di più variabili contemporaneamente)
plt.figure(figsize = (15,4))
sns.boxplot(data = dataframe, orient = "h")
plt.show()
#Standardizzazione del dataset per evitare problemi di visualizzazione
scaler = StandardScaler()
scaled_array = scaler.fit_transform(dataframe)
scaled_dataframe = pd.DataFrame( scaled_array, columns = dataframe.columns )
plt.figure(figsize = (15,4))
sns.boxplot(data = scaled_dataframe, orient = "h")
plt.show()
scaled_dataframe.describe()
#Modellazione con K-means
#Decisione del numero di cluster di partenza
kmeans_model = KMeans(n_clusters = 4)
#Fitting
kmeans_model.fit(scaled_dataframe)
centroids = kmeans_model.cluster_centers_
centroids
kmeans_model.cluster_centers_.shape
#Visualizzazione risultati
kmeans_model.labels_
dataframe["cluster"] = kmeans_model.labels_
dataframe
sns.pairplot(data = dataframe, hue = "cluster", palette = "Accent_r")
plt.show()
#Ricerca del numero ottimale di cluster tramite indice di silhouette
#Si itera da 2 a 100 cluster e si salva l'indice in un dizionario
k_to_test = range(2,25,1) # [2,3,4, ..., 24]
silhouette_scores = {}
for k in k_to_test:
model_kmeans_k = KMeans( n_clusters = k )
model_kmeans_k.fit(scaled_dataframe.drop("cluster", axis = 1))
labels_k = model_kmeans_k.labels_
score_k = metrics.silhouette_score(scaled_dataframe.drop("cluster", axis=1), labels_k)
silhouette_scores[k] = score_k
print("Tested kMeans with k = %d\tSS: %5.4f" % (k, score_k))
print("Done!")
Unfortunately, at the moment when the iterations to calculate the optimal number of clusters should start it returns the following error:
File "D:\Tesi\K-means\Main.py", line 73, in <module>
model_kmeans_k.fit(scaled_dataframe.drop("cluster", axis = 1))
File "D:\Anaconda\envs\py39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\frame.py", line 4906, in drop
return super().drop(
File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\generic.py", line 4150, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\generic.py", line 4185, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "D:\Anaconda\envs\py39\lib\site-packages\pandas\core\indexes\base.py", line 6017, in drop
raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['cluster'] not found in axis"
Searching on the internet I could not find a solution. How can I solve this problem?
Solution 1:[1]
KeyError: "['cluster'] not found in axis" is from scaled_dataframe.drop("cluster", axis = 1) which means that scaled_dataframe does not have a column called 'cluster'.
Your code says you only added such column to dataframe and you added it after scaled_dataframe was created, so you never added such column to scaled_dataframe in your code.
You can print the list of the columns to verify
print(scaled_dataframe.columns)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raymond Kwok |
