'Groupby and random select without replacement by a column
Here is the example:
df = pd.DataFrame(data={"Name":["AAA", "ABC", "CCC", "XYZ", "DEF", "YYH"],
"Group_id":[1, 1, 2, 2, 3, 3], "tax_id": [100, 50, 200, 200, 300, 300]})
df.head()
I want to group by Group_id and select at most n random rows for each group, where tax_id is unique:
Here is my solution which is not complete. It will select at most 3 samples for every group but tax_id is not unique.
for n= 3
df.groupby("Group_id").apply(lambda x: x if len(x) < n else x.sample(n, random_state=123))
Suggestion to finish this ?
Solution 1:[1]
Here is one solution:
We can drop duplications before:
df = df.drop_duplicates(["tax_id", "Group_id"])
df.reset_index(inplace=True, drop=True)
df.groupby("Group_id").apply(lambda x: x if len(x) else x.sample(3, random_state=123))
Solution 2:[2]
If your sampling strategy is based on the distribution of the most common combinations of group_id and tax_id, by dropping the duplicates you lose a piece of information. If you want to preserve that distribution here is one solution where n is a parameter that determines the number of samplings:
sliced_data = data.copy()
n = 3
samples = pd.DataFrame()
for n in range(n):
_grouped = sliced_data.groupby("Group_id")
sample = _grouped.sample(1,random_state =13)
for i, s in sample.iterrows():
x = sliced_data.loc[(sliced_data.Group_id == s[1])&(sliced_data.tax_id== s[2])]
sliced_data = sliced_data.drop(x.index)
samples = pd.concat([samples,sample])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | LearnToGrow |
| Solution 2 | pieffe |

