'Groupby and random select without replacement by a column

Here is the example:

df = pd.DataFrame(data={"Name":["AAA", "ABC", "CCC", "XYZ", "DEF", "YYH"],
"Group_id":[1, 1, 2, 2, 3, 3], "tax_id": [100, 50, 200, 200, 300, 300]})
df.head()

I want to group by Group_id and select at most n random rows for each group, where tax_id is unique:

Here is my solution which is not complete. It will select at most 3 samples for every group but tax_id is not unique.

for n= 3

df.groupby("Group_id").apply(lambda x: x if len(x) < n else x.sample(n, random_state=123))

enter image description here

Suggestion to finish this ?



Solution 1:[1]

Here is one solution:

We can drop duplications before:

df = df.drop_duplicates(["tax_id", "Group_id"])
df.reset_index(inplace=True, drop=True)

df.groupby("Group_id").apply(lambda x: x if len(x) else x.sample(3, random_state=123))

Solution 2:[2]

If your sampling strategy is based on the distribution of the most common combinations of group_id and tax_id, by dropping the duplicates you lose a piece of information. If you want to preserve that distribution here is one solution where n is a parameter that determines the number of samplings:

sliced_data = data.copy()
n = 3
samples = pd.DataFrame()
for n in range(n):
    _grouped = sliced_data.groupby("Group_id")
    sample = _grouped.sample(1,random_state =13)
    for i, s in sample.iterrows():
        x = sliced_data.loc[(sliced_data.Group_id == s[1])&(sliced_data.tax_id== s[2])]
        sliced_data = sliced_data.drop(x.index)
    samples = pd.concat([samples,sample])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 LearnToGrow
Solution 2 pieffe