'Implementation details of K-means++ without sklearn
I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.
For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen
I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?
For my implementation, I now just initialized the first element
self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]
for k in range(1, self.num_clusters):
for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?
Solution 1:[1]
You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.
This would look like this:
centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf)
for j in range(1,self.n_clusters):
distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance)
p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
sample = np.random.choice(X.shape[0], p = p)
centers.append(X[sample])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | savoga |

