'Assign remainders to specific bins in pandas.qcut()
I am trying to replicate a specific method of attributing records into deciles, and there is the pandas.qcut()
function which does a good job. My only concern is that there doesn't be a method to attribute an uneven number to a specific bin as denoted by the method I am trying to replicate.
This is my example:
num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()
Which outputs:
9 16
4 16
0 16
8 15
7 15
6 15
5 15
3 15
2 15
1 15
There are 7 with 15 and 3 with 16, what I would like to do is to specify the bins that would receive 16 records:
9 16 <
4 16
0 16
8 15
7 15
6 15
5 15 <
3 15
2 15 <
1 15
Is this possible using pd.qcut
?
Solution 1:[1]
As there was no answer, and asking a few people it didn't seem possible, I have cobbled together a function that does this:
def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
raise ValueError('Attempted to allocate to a bin that doesnt exist')
base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
if number_of_values_to_allocate == 0:
df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
return df
elif number_of_values_to_allocate > len(bins_for_extras):
raise ValueError('There are more values to allocate than the list provided, please select more bins')
bins = {}
for i in range(number_of_bins):
number_of_values_in_bin = base_number
if i in bins_for_extras:
number_of_values_in_bin += 1
bins[i] = number_of_values_in_bin
df1 = df.copy()
df1['rank'] = df1[value_series].rank()
df1 = df1.sort_values(by=['rank'])
df1['bins'] = 0
row_to_start_allocate = 0
row_to_end_allocate = 0
for bin_number, number_in_bin in bins.items():
row_to_end_allocate += number_in_bin
bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
row_to_start_allocate = row_to_end_allocate
conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
series_to_add = pd.Series()
for idx, series in enumerate(conditions):
series[series > -1] = idx
series_to_add = series_to_add.append(series)
df1['bins'] = series_to_add
df1 = df1.reset_index()
return df1
It ain't pretty, but it does the job. You pass in the dataframe, the name of the column with the values, and an ordered list of the bins where any extra values should be allocated. I'd happily take some advise as to how to improve this code.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | RustyBrain |