'Determining which K value performed better

I have two datasets which represents a K means clustering result. I am trying to find a way to see which K value was better in creating clusters which have a similar/same number of each asset in each cluster.

I have a result with 120 assets using K = 3 and K = 6. It appears to me that K = 3 was better in having 3 clusters which contain similar/same number of each asset compared to K = 6 but I would like to check this somehow to ensure this is the correct observation. I have thought about using t.test but I am not sure if this is the correct approach.

R with t.test

Values <- matrix(c(9,   4,  2,  1,  7,  6,
                   1,   1,  2,  2,  1,  1,
                   1,   3,  3,  6,  1,  1,
                   1,   3,  3,  1,  2,  2), nrow = 4, ncol = 6, byrow = TRUE)

Values2 <- matrix(c(2,  9,  9,  2,  7,  9,
                   4,   2,  3,  4,  3,  2,
                   2,   1,  2,  1,  1,  1,
                   3,   2,  1,  3,  4,  1,
                   6,   6,  3,  7,  5,  7,
                   3,   1,  2,  4,  1,  1), nrow = 6, ncol = 6, byrow = TRUE)

t.test(Values, Values2, paired = FALSE)

Welch Two Sample t-test

data:  Values and Values2
t = -1.2633, df = 53.308, p-value = 0.212
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.012466  0.456910
sample estimates:
mean of x mean of y 
 2.666667  3.444444

My observation is based on plotting them to bar charts

K = 6

vs K = 3

r statistics cluster-analysis k-means

Solution 1:^[1]

library(tidyverse)

Values <- matrix(c(
  9, 4, 2, 1, 7, 6,
  1, 1, 2, 2, 1, 1,
  1, 3, 3, 6, 1, 1,
  1, 3, 3, 1, 2, 2
), nrow = 4, ncol = 6, byrow = TRUE)

# get number of clusters k for which the cluster sizes are most similar
tibble(
  k = seq(2, 3),
  cluster_size_var = k %>% map_dbl(~ Values %>%
    kmeans(.x) %>%
    pluck("cluster") %>%
    table() %>%
    var())
) %>%
  arrange(cluster_size_var) %>%
  head(1)
#> # A tibble: 1 × 2
#>       k cluster_size_var
#>   <int>            <dbl>
#> 1     3            0.333

^{Created on 2022-03-15 by the reprex package (v2.0.0)}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	danlooo

'Determining which K value performed better

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]