'Python Kmeans visualization (High Dimensions)
I have to cluster my customers whose have more than 15 dimensions with Python.
Can you advice me, is it correct to visualize clusters after Kmeans with T-SNE method? I received very good plot even with outliers in my data-frame. This makes me doubt if I'm doing everything right. My colleagues who make clustering with High Dimensions on R don't use any method of dimensionality reduction like PCA or T-SNE and noticed that probably I'm not correct.
It's my first experience in it. Thanks in advance for your help.
I wrote my code from example : https://www.kaggle.com/minc33/visualizing-high-dimensional-clusters
My code:
#import libraries
import numpy as np
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import pandas.io.sql as psql
import plotly.graph_objs as go
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from scipy import stats
import matplotlib.pyplot as plt
#download my dataframe
data = pd.read_csv ('C:\\Users\\Desktop\python\ex.csv')
#use data without customer id
d=data.iloc[:,1:20]
#Kmeans
X =d.copy()
scaler = MinMaxScaler()
numer = pd.DataFrame(scaler.fit_transform(X))
kmeans = KMeans(n_clusters=3)
kmeans.fit(numer)
clusters = kmeans.predict(numer)
numer["Cluster"] = clusters
#visualisation
plotX = pd.DataFrame(np.array(numer.sample(10000)))
plotX.columns =numer.columns
perplexity = 50
tsne_2d = TSNE(n_components=2, perplexity=perplexity)
TCs_2d = pd.DataFrame(tsne_2d.fit_transform(plotX.drop(["Cluster"], axis=1)))
TCs_2d.columns = ["TC1_2d","TC2_2d"]
plotX = pd.concat([plotX,TCs_2d], axis=1, join='inner')
cluster0 = plotX[plotX["Cluster"] == 0]
cluster1 = plotX[plotX["Cluster"] == 1]
cluster2 = plotX[plotX["Cluster"] == 2]
trace1 = go.Scatter(
x = cluster0["TC1_2d"],
y = cluster0["TC2_2d"],
mode = "markers",
name = "Cluster 0",
marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
x = cluster1["TC1_2d"],
y = cluster1["TC2_2d"],
mode = "markers",
name = "Cluster 1",
marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
x = cluster2["TC1_2d"],
y = cluster2["TC2_2d"],
mode = "markers",
name = "Cluster 2",
marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
text = None)
data = [trace1, trace2, trace3]
title = "Visualizing Clusters in Two Dimensions Using T-SNE (perplexity=" + str(perplexity) + ")"
layout = dict(title = title,
xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
plot(fig)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
