'Sampling from a Pandas DataFrame to match a continuous distribution of another DataFrame
I have a number of dataframes (41 in total) of varying sizes (min = <1000 rows, max = 125k rows). For each row in a dataframe, I have a column (that for illustrative purposes I'll call count) that counts over a number of elements in an adjacent column - something like this:
DATAFRAME 1 DATAFRAME 2
---------------- ----------------
letters | Count letters | Count
---------------- ----------------
ABC | 3 ABCDE | 5
DEFG | 4 WXYZAB | 6
AB | 2 AB | 2
YZ | 2 ABCDEFGHIJ | 10
---------------- ----------------
However, the distribution of count is not the same in every dataframe (ignoring the values in the example above). This is skewing performance in a downstream task. What I want to do is take advantage of the size disparity between the dataframes, and sample from the larger dataframes such that their distribution of count approximates to the target (smaller) dataframe. How do I go about this?
Something I had been thinking of is detailed here. Basically, we bin the target distribution (dataframe1['Count']) and use that to create a stratified subset using train_test_split. That seems like it would work, but appears to be a workaround given that I am not intended to create a train/test split in the data (although we could just be arguing over semantics here).
Is there a proper process/name/package that tackles this problem explicitly? I've been trying to understand kernel density estimators as a potential solution to this problem. Would that be an alternative way to approach this problem?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
