'Python Kmeans visualization (High Dimensions)

I have to cluster my customers whose have more than 15 dimensions with Python. Can you advice me, is it correct to visualize clusters after Kmeans with T-SNE method? I received very good plot even with outliers in my data-frame. This makes me doubt if I'm doing everything right. My colleagues who make clustering with High Dimensions on R don't use any method of dimensionality reduction like PCA or T-SNE and noticed that probably I'm not correct. It's my first experience in it. Thanks in advance for your help.

I wrote my code from example : https://www.kaggle.com/minc33/visualizing-high-dimensional-clusters

My code:

#import libraries
import numpy as np
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import pandas.io.sql as psql
import plotly.graph_objs as go
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE 
from sklearn.cluster import KMeans
from scipy import stats
import matplotlib.pyplot as plt
#download my dataframe
data = pd.read_csv ('C:\\Users\\Desktop\python\ex.csv')
#use data without customer id
d=data.iloc[:,1:20]
#Kmeans
X =d.copy()
scaler = MinMaxScaler()
numer = pd.DataFrame(scaler.fit_transform(X))
kmeans = KMeans(n_clusters=3)
kmeans.fit(numer)
clusters = kmeans.predict(numer)
numer["Cluster"] = clusters
#visualisation
plotX = pd.DataFrame(np.array(numer.sample(10000)))
plotX.columns =numer.columns
perplexity = 50
tsne_2d = TSNE(n_components=2, perplexity=perplexity)
TCs_2d = pd.DataFrame(tsne_2d.fit_transform(plotX.drop(["Cluster"], axis=1)))
TCs_2d.columns = ["TC1_2d","TC2_2d"]
plotX = pd.concat([plotX,TCs_2d], axis=1, join='inner')
cluster0 = plotX[plotX["Cluster"] == 0]
cluster1 = plotX[plotX["Cluster"] == 1]
cluster2 = plotX[plotX["Cluster"] == 2]
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
data = [trace1, trace2, trace3]
title = "Visualizing Clusters in Two Dimensions Using T-SNE (perplexity=" + str(perplexity) + ")"
layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
plot(fig)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Python Kmeans visualization (High Dimensions)

Sources

Related Questions