'Assign remainders to specific bins in pandas.qcut()

I am trying to replicate a specific method of attributing records into deciles, and there is the pandas.qcut() function which does a good job. My only concern is that there doesn't be a method to attribute an uneven number to a specific bin as denoted by the method I am trying to replicate.

This is my example:

num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()

Which outputs:

9    16
4    16
0    16
8    15
7    15
6    15
5    15
3    15
2    15
1    15

There are 7 with 15 and 3 with 16, what I would like to do is to specify the bins that would receive 16 records:

9    16 <
4    16
0    16
8    15
7    15
6    15
5    15 <
3    15
2    15 <
1    15

Is this possible using pd.qcut?



Solution 1:[1]

As there was no answer, and asking a few people it didn't seem possible, I have cobbled together a function that does this:

 def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
    if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
        raise ValueError('Attempted to allocate to a bin that doesnt exist')
    base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
    bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
    if number_of_values_to_allocate == 0:
        df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
        return df
    elif number_of_values_to_allocate > len(bins_for_extras):
        raise ValueError('There are more values to allocate than the list provided, please select more bins')
    bins = {}
    for i in range(number_of_bins):
        number_of_values_in_bin = base_number
        if i in bins_for_extras:
            number_of_values_in_bin += 1
        bins[i] = number_of_values_in_bin
    df1 = df.copy()
    df1['rank'] = df1[value_series].rank()
    df1 = df1.sort_values(by=['rank'])
    df1['bins'] = 0
    row_to_start_allocate = 0
    row_to_end_allocate = 0
    for bin_number, number_in_bin in bins.items():
        row_to_end_allocate += number_in_bin
        bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
        row_to_start_allocate = row_to_end_allocate
    conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
    series_to_add = pd.Series()
    for idx, series in enumerate(conditions):
        series[series > -1] = idx
        series_to_add = series_to_add.append(series)
    df1['bins'] = series_to_add
    df1 = df1.reset_index()
    return df1

It ain't pretty, but it does the job. You pass in the dataframe, the name of the column with the values, and an ordered list of the bins where any extra values should be allocated. I'd happily take some advise as to how to improve this code.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 RustyBrain