'Pandas sample with weights
I have df and I'd like to make some sampling from it with respect to distribution of some variable. Let's say df['type'].value_counts(normalize=True) returns:
A 0.3
B 0.5
C 0.2
I'd like to make something like sampledf = df.sample(weights=df['type'].value_counts(normalize=True)) such that sampledf ['type'].value_counts(normalize=True) will return almost the same distridution. How to pass dict with frequency here?
Solution 1:[1]
Weights has to take a series of the same length as the original df, so best is to add it as a column:
df['freq'] = df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)
Or without adding the column:
sampledf = df.sample(weights = df.groupby('type')['type'].transform('count'))
Solution 2:[2]
In addition to the answer above, it should be noted that if you want to sample each type equally you should adjust your code to:
df['freq'] = 1./df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)
In the case of two classes. If you have more than two classes, you can use the following code to generalize the weights calculation:
w_j=n_samples / (n_classes * n_samples_j)
Solution 3:[3]
No need to create "a series of the same length as the original df". Instead you can just sample from each group by passing the factored output of value_counts like this:
col = 'type'
sample_factor = .3
# sample size per group
weights = (df[col].value_counts() * sample_factor).astype(int)
df.groupby(col).apply(lambda g: g.sample(n=weights[g.name]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Richard |
| Solution 3 |
