'Balance different groups of values in a dataframe in an equally spaced manner
ll = [['r']*5, ['sq']*3, ['r']*5, ['pu']*2, ['r']*5, ['dsp']*3, ['r']*5, ['l']*5, ['r']*5, ['dr']*3, ['r']*5, ['si']*4, ['r']*5,
['te']*2, ['r']*5, ['bc']*3, ['r']*5, ['lsr']*2, ['r']*5, ['jj'], ['r']*5,]
l = [item for sublist in ll for item in sublist]
df_l = pd.DataFrame(l)
The list ll is a simplified version of a time series list (time stamps are omitted here for simplification). It contains 11 unique elements. I want to balance the elements, which means that all elements shall appear more or less equally frequent.
Example: As you can see, 'jj' appears only once. Therefore I want to reduce the other elements such that they appear only once, too. (In my actual case of application, no element appears only once. The least frequent element 'jj' appears 2000 times, while 'r' appears 170000 times. This here is just a representative simplification)
Since I am dealing with time series data, I cannot just randomly delete rows of the more frequent elements until all element groups are balanced, since this could destroy time series patterns. Instead, I want that the overrepresented entries are removed in an equally spaced manner (e.g. if 'te' appears twice as often as 'jj', I want to delete every second row of 'te'. This assures that only the "resolution" of the time series patterns is reduced, but the patterns themselves remain. How can I do this efficiently?
I would present my tries but I have no idea how to approach this.
Solution 1:[1]
What about:
s = df[0] # Whatever the series actually is
# Group the rows into sets of consecutive duplicates
groups = [g for _,g in df.groupby((s != s.shift()).cumsum())]
# Find the smallest group of consecutive values
n = min(len(g) for g in groups)
# Trim each group to that size and stack them into a new DataFrame
df2 = pd.concat((g[:n] for g in groups), ignore_index=True)
For your data, the result is:
0 r 1 sq 2 r 3 pu 4 r 5 dsp 6 r 7 l 8 r 9 dr 10 r 11 si 12 r 13 te 14 r 15 bc 16 r 17 lsr 18 r 19 jj 20 r
Sidenote: While it seems like it might do what you're asking, I just want to point out that the representation/relative frequency of the distinct values is completely lost.
A better solution might be to reduce each group by the same "percentage". So if the smallest group is 3, instead of reducing each group to 3 items, you'd reduce all groups to 1/3 of their size. Or maybe that is what you're looking for...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
