'How to plot clusters using k means with datasets of tweets?

I have a dataset containing tweets, after pre processing the tweets i tried clustering them:

# output the result to a text file.

clusters = df.groupby('cluster')    

for cluster in clusters.groups:
    f = open('cluster'+str(cluster)+ '.csv', 'w') # create csv file
    data = clusters.get_group(cluster)[['id','Tweets']] # get id and tweets columns
    f.write(data.to_csv(index_label='id')) # set index to id
    f.close()


print("Cluster centroids: \n")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(k):
    print("Cluster %d:" % i)
    for j in order_centroids[i, :10]: #print out 10 feature terms of each cluster
        print (' %s' % terms[j])
    print('------------')

Hence it is grouping my tweets in 6 clusters. How can i plot them in 2D as a whole?



Solution 1:[1]

First of all, if you're using Kmeans clustering algorithm, you always manually specify the number of clusters, so you could simply set it to 2.

But, if you're somehow (elbow method, silhouette score or something else) decided that 6 clusters are better, than two, you should use some dimensional reduction (sklearn's PCA, TSNE, etc) on your features and then just scatterplot them with point colors corresponding for clusters. This will look like this:

from sklear.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2, svd_solver='full')
X_decomposed = pca.fit_transform(X)

plt.scatter(X_decomposed[0], X_decomposed[1], c=clustering)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 K0mp0t