'Generate underlying distribution from bins in python

I found a PDF document describing the income distribution in the US in 1978. Per income range I have the percentage of the population that falls in that income range. I'd like to generate the underlying distribution in python. The data looks something like this:

under 3000$: 6.2%
$3000-4999$: 8.5%
$5000-$6999: 7.6%

etc

See the screenshot for a more detailed description.

Income bins

I've found the function scipy.stats.rv_histogram that generates a distribution given a histogram, but I'm not sure how to create this initial histogram.



Solution 1:[1]

Refer the documentation for scipy.stats.rv_histogram it clearly states that:

Parameters: histogram: tuple of array_like
Tuple containing two array_like objects The first containing the content of n bins The second containing the (n+1) bin boundaries In particular the return value np.histogram is accepted

You can create the histogram like:

import scipy.stats
import numpy as np

total_number = 77330
prob = np.array([6.2, 8.5, 7.6, 10.8, 7.1, 9.6, 15.3, 12.2, 22.7])
data = prob*total_number
bin_boundary = np.array([0, 3000, 5000, 7000, 10000, 12000, 15000, 20000, 25000, 1e7])

hist = (data, bin_boundary)
hist_dist = scipy.stats.rv_histogram(hist)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mankind_008