'Determining which K value performed better
I have two datasets which represents a K means clustering result. I am trying to find a way to see which K value was better in creating clusters which have a similar/same number of each asset in each cluster.
I have a result with 120 assets using K = 3 and K = 6. It appears to me that K = 3 was better in having 3 clusters which contain similar/same number of each asset compared to K = 6 but I would like to check this somehow to ensure this is the correct observation. I have thought about using t.test but I am not sure if this is the correct approach.
R with t.test
Values <- matrix(c(9, 4, 2, 1, 7, 6,
1, 1, 2, 2, 1, 1,
1, 3, 3, 6, 1, 1,
1, 3, 3, 1, 2, 2), nrow = 4, ncol = 6, byrow = TRUE)
Values2 <- matrix(c(2, 9, 9, 2, 7, 9,
4, 2, 3, 4, 3, 2,
2, 1, 2, 1, 1, 1,
3, 2, 1, 3, 4, 1,
6, 6, 3, 7, 5, 7,
3, 1, 2, 4, 1, 1), nrow = 6, ncol = 6, byrow = TRUE)
t.test(Values, Values2, paired = FALSE)
Welch Two Sample t-test
data: Values and Values2
t = -1.2633, df = 53.308, p-value = 0.212
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.012466 0.456910
sample estimates:
mean of x mean of y
2.666667 3.444444
My observation is based on plotting them to bar charts
K = 6
vs K = 3
Solution 1:[1]
library(tidyverse)
Values <- matrix(c(
9, 4, 2, 1, 7, 6,
1, 1, 2, 2, 1, 1,
1, 3, 3, 6, 1, 1,
1, 3, 3, 1, 2, 2
), nrow = 4, ncol = 6, byrow = TRUE)
# get number of clusters k for which the cluster sizes are most similar
tibble(
k = seq(2, 3),
cluster_size_var = k %>% map_dbl(~ Values %>%
kmeans(.x) %>%
pluck("cluster") %>%
table() %>%
var())
) %>%
arrange(cluster_size_var) %>%
head(1)
#> # A tibble: 1 × 2
#> k cluster_size_var
#> <int> <dbl>
#> 1 3 0.333
Created on 2022-03-15 by the reprex package (v2.0.0)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | danlooo |


