'K-Means R vs K-Means Python different cluster values generating different bar Graphs

Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in python does not match are I believe its the centers. Please review and provide some feed back.

Please use the link below to download the data set I used to generate these graphs.

https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0

R Code

## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
  

pcp <- read.csv(file='E:\\ProgramData\\R\\Code\\TableStats2.csv')

#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]


#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]

#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)

plot.data <- data.frame(pca$x[, 1:2])

set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)

g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
  geom_point(size = 3.5) +
  geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
  theme_bw()

behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")

g2 <- ggplot(behavious, aes(x = variable, y = value)) +
  geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
  facet_wrap(~cluster) +
  theme_grey() +
  theme(axis.text.x = element_text(angle = 90)) 

python code

import pandas as pd    
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans    
from matplotlib import pyplot as plt    
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text

TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
  #print(label)
  plt.annotate(label,(x1[i], y1[i]))
plt.show()

df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 

clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)

df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables

#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
    ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')

plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()

# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]

b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")

(ggplot(b2, aes(x = 'variable', y = 'value')) + 
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") + 
facet_wrap('~cluster') + 
theme_grey() + 
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8)) 
)


Solution 1:[1]

Update now I have this working in R and Python

Looking at this specific problem, check the outputs of the PCA - they're different, so k-means won't be the same. The reason is in your R code - you repeat the line data <- data[, -1], dropping the table names and the first column of the data. Remove the extra line, and the clusters look the same.


General comments on R and Python implementation of kmeans

In general, it looks like R and python use different algorithms by default. R uses "Hartigan-Wong" by default, and Python's scikit-learn probably uses "elkan". Set algorithm='Lloyd' in R and algorithm='full' in Python (which I believe currently will run Lloyd's algorithm as well) to ensure they're at least attempting the same thing.

You also have different initialisation methods - R is random and for Python you are using 'k-means++'. Set init='random' in Python to make these match.

They have different numbers of max iteartions - R defaults to 10, Python to 300. Set these as equal also.

Finally, you won't see any random variation in your python script if you set the random_state in the Python KMeans call (and check you haven't set.seed in R also).

Once you've done this, try running both multiple times, and compare the distributions of values. Hopefully you'll see overlap between the two implementations.

Check out the docs for the R implementation and the scikit-learn implementation.

And a final point here - kmeans is unsupervised. The class labels have no absolute meaning. Run the code multiple times, and class 0 will not always be assigned to the same data points, even if data points are grouped identically.

Here's a reproducible example of this:

import pandas as pd
from sklearn import cluster, datasets

from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)

random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):

    clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)

    df = (df
      .assign(**{
          'Cluster': clusters.labels_,
          'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
          'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
      })
     )
    
    row = i//2
    col = i - row*2
    sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
    sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, 
                    palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])   

Here's a version with your data:

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

TableStats = pd.read_csv('TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
            'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']

# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]

random_states = [1,2,3,4,5,6]
for r in random_states:
    df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
                                      'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 
    clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)

    df = (df
          .assign(**{
              'Cluster': clusters.labels_,
              'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
              'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
          })
         )
    
    plt.figure(figsize=(20, 11))
    ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
    ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', 
                         s=1000, palette='coolwarm', legend=False, alpha=0.1)    

    plt.legend(loc='upper right', title='Cluster')
    ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
    plt.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1