'Get inertia for nltk k means clustering using cosine_similarity

I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...

The code below is how people usually find inertia using sklearn k means.

inertia = []
for n_clusters in range(2, 26, 1):
  clusterer = KMeans(n_clusters=n_clusters)
  preds = clusterer.fit_predict(features)
  centers = clusterer.cluster_centers_
  inertia.append(clusterer.inertia_)

plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()


Solution 1:[1]

you can write your own function to obtain the inertia for Kmeanscluster in nltk.

As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster.. enter image description here

Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.

 feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
 centroid = df['centroid'].to_numpy()

 def nltk_inertia(feature_matrix, centroid):
     sum_ = []
     for i in range(feature_matrix.shape[0]):
         sum_.append(np.sum((feature_matrix[i] - centroid[i])**2))  #here implementing inertia as given in the docs of scikit i.e sum of squared distance..

     return sum(sum_)

 nltk_inertia(feature_matrix, centroid)
 #op 27.495250000000002

 #now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2

scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors)  # vectors = [np.array(f) for f in df.values]  which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006

Solution 2:[2]

The previous comment is actually missing a small detail:

feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()

 def nltk_inertia(feature_matrix, centroid):
     sum_ = []
     for i in range(feature_matrix.shape[0]):
         sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))  

     return sum(sum_)

You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 qaiser
Solution 2 Salih K?l?çl?