'Sample 70%, 20% and 10% of the unique values in a column

I have data in pandas df as follows

d = {'col1': [1,1,1,2,2,2,3,3,3], 'col2':[1,2,3,1,2,3,1,2,3]}
df = pd.DataFrame(data=d)

I want to take 70%/20%/10% splits of the unique values of col1 (ex, all the values 1 and 2 but not 3 and yes, I know that's 66% in this example) and put each split in its own new dataframe. When I use something like df.sample(frac=.7) I get 70% of all the rows, but a mix of the possible values in col1.

I know I could do a unique of col1, find 70% of those values, copy that 70% into a new data frame, drop the 70% from my list of unique values, join my 70% to the full data set, then do the whole exercise again for the 20% and 10%, but that seems inefficient. Is there some built in functionality to split a grouped dataframe like this?



Solution 1:[1]

I solved this as follows

train_df = df.groupby("col1").sample(frac=.7)
remain = df[~df.col1.isin(train_df)]
test_df = df.groupby("col1").sample(frac=.66)
validation_df = df[~df.col1.isin(test_df)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 knobby