'Implementation details of K-means++ without sklearn

I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.

For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen

enter image description here

I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?

For my implementation, I now just initialized the first element

self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]

for k in range(1, self.num_clusters):

for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?



Solution 1:[1]

You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.

This would look like this:

centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf) 
for j in range(1,self.n_clusters):
    distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance) 
    p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
    sample = np.random.choice(X.shape[0], p = p)
    centers.append(X[sample])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 savoga