'how to enumerate clusters in sorted order by average cluster values?
I have created a function for clustering that is applied on various dataframes and columns. I would like to have a consistent cluster ordering, ordered by mean values.
For example, the ordering of the following cluster is OK:
0: 1.5
1: 2.0
3: 2.5
and for the following cluster is not OK:
0: 2.0
1: 1.6
2: 1.2
It should be
0: 1.2
1: 1.6
2: 2.0
The function that groups the selected column to clusters is the following:
from sklearn.cluster import KMeans
from sklearn.preprocessing import robust_scale
def get_clustered_values(df, column_value, column_clustered, clusters_count):
# df - pandas dataframe
# column_value - name of the column for which the values are going to be split to clusters
# column_clustered - name of the column of cluster indices
value_scaled = robust_scale(df[column_value])
kmeans = KMeans(n_clusters=clusters_count)
clusters = kmeans.fit(value_scaled.reshape(-1,1))
df[column_clustered] = clusters.labels_
Does anyone how to assign cluster indices such that the cluster with the lowest mean value has index 0 and the cluster with highest mean value has index clusters_count - 1?
Solution 1:[1]
Use transform after groupby to broadcast the mean then sort_values before extracting index:
df.loc[df.groupby([column_clustered]).transform('mean').sort_values().index]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
