'Visualization of K-Means Clustering of multiple columns

Dataset file : google drive link

Hello Community , I need help regarding how to apply KNN clustering on this use case.

I have a dataset consisting (27884 ROWS, 8933 Columns)

Here's a little preview of a dataset

user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11
1 1 7 2 3 8 0 4 0 6 0 5
2 7 8 1 2 4 6 5 9 10 3 0
3 0 0 0 0 1 5 2 3 4 0 6
4 1 7 2 3 8 0 5 0 6 0 4
5 0 4 7 0 6 1 5 3 0 0 2
6 1 0 2 3 0 5 4 0 0 6 7

Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.

This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)

Here's the complete dataset shape information

enter image description here

I'm Applying KMEANS CLUSTERING. How do I visualize all the clusters using all the columns

As I stated there are 27844 users & 8932 other columns I have achieved by just using user_iD & b1 column only. How do I take all the columns at once?

What I have tried so far

#Build and train the model
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)
model.fit(df3)

#See the predictions
model.labels_
model.cluster_centers_

#PLot the predictions against the original data set
fig = plt.figure(figsize=(6, 6))
#ax = fig.add_subplot(111)
plt.scatter(df3['user_iD'], df3['b1'],cmap='rainbow',
           linewidths=1, alpha=.7,
           edgecolor='k'
           )
plt.show()

This gives me clustering visualization based on a single column.



Solution 1:[1]

Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.

pca_num_components = 2

reduced_data = PCA(n_components=pca_num_components).fit_transform(df3.iloc[:,1:12])
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=df3['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 SAL