'Adding weights to K-means clustering algorithm
I found an implementation of a K-means clustering algorithm, built from scratch. It is easy to understand and each step is well documented.
You have to specify the number of clusters k and the maximum iteration as input to run the algorithm.
def kmeans(X, k, maxiter, seed = None):
n_row, n_col = X.shape
# randomly choose k data points as initial centroids
if seed is not None:
np.random.seed(seed)
rand_indices = np.random.choice(n_row, size = k)
centroids = X[rand_indices]
for itr in range(maxiter):
# compute distances between each data point and the set of centroids
# and assign each data point to the closest centroid
distances_to_centroids = pairwise_distances(X, centroids, metric = 'euclidean')
cluster_assignment = np.argmin(distances_to_centroids, axis = 1)
# select all data points that belong to cluster i and compute
# the mean of these data points (each feature individually)
# this will be our new cluster centroids
new_centroids = np.array([X[cluster_assignment == i].mean(axis = 0) for i in range(k)])
# if the updated centroid is still the same,
# then the algorithm converged
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, cluster_assignment
I run the algorithm by running:
k = 4 centers, label = kmeans(X, k, maxiter = 100)
I now want to add weights to finding the new cluster centers. Therefore, I would like to add weights to this step:
new_centroids = np.array([X[cluster_assignment == i].mean(axis = 0) for i in range(k)])
I have a dataset with three columns, 2 that would function for the x and y coordinates for the clustering and a third that would act as weight for each datapoint.
However, adding weights to .mean(axis = 0), as .mean(axis = 0, weights = X[:,2]) is not possible. Error:
_mean() got an unexpected keyword argument 'weights'
This is due to the fact that this argument is not known in the functions library (if I'm correct).
It is possible to add the argument weights to np.average: (https://numpy.org/doc/stable/reference/generated/numpy.average.html)
However, simply changing the code to:
new_centroids = np.array([X[cluster_assignment == i].average(axis = 0,weights=X[:,1]) for i in range(k)])
Does not work. This gives the error: "'numpy.ndarray' object has no attribute 'average'"
This is due to the fact that numpy.ndarray has a mean, max, std method, but does not have a average method (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)
I do not know how to rewrite the code now. Would be really great if anyone could help me out with this!
Happy easters everyone.
If anyone wants the full code as the example provides: http://ethen8181.github.io/machine-learning/clustering/kmeans.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
