'how to enumerate clusters in sorted order by average cluster values?

I have created a function for clustering that is applied on various dataframes and columns. I would like to have a consistent cluster ordering, ordered by mean values.

For example, the ordering of the following cluster is OK:

0: 1.5
1: 2.0
3: 2.5

and for the following cluster is not OK:

0: 2.0
1: 1.6
2: 1.2

It should be

0: 1.2
1: 1.6
2: 2.0

The function that groups the selected column to clusters is the following:

from sklearn.cluster import KMeans
from sklearn.preprocessing import robust_scale

def get_clustered_values(df, column_value, column_clustered, clusters_count):
    # df - pandas dataframe
    # column_value - name of the column for which the values are going to be split to clusters
    # column_clustered - name of the column of cluster indices
    value_scaled = robust_scale(df[column_value])
    kmeans = KMeans(n_clusters=clusters_count)
    clusters = kmeans.fit(value_scaled.reshape(-1,1))
    df[column_clustered] = clusters.labels_

Does anyone how to assign cluster indices such that the cluster with the lowest mean value has index 0 and the cluster with highest mean value has index clusters_count - 1?



Solution 1:[1]

Use transform after groupby to broadcast the mean then sort_values before extracting index:

df.loc[df.groupby([column_clustered]).transform('mean').sort_values().index]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien