'Custom train-test split using two stratified classes

I am trying to use NLP techniques to predict the mental status of patients at the time doctor's notes were taken. I have two classes, suffix (multilabel) as well as mental_status (binary, Lucid or Delirium). I am trying to build train and test sets from a reduced list that meet the following specifications:

Shared unique suffix list
For each set for each suffix, at least one row for Lucid and Delirium

Any imbalances I will correct with sample weights (I am only trying to predict mental_status, but I need the data to also be balanced on suffix). I just need at least one row for each combination within the full Cartesian product of suffixes and mental_statuses.

For example, in the dataset below, only the DWM Communication suffixes would make it into the train-test split.

suffix              mental_status
DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid
DWN Psychiatry      Lucid
DWN Psychiatry      Delirium
DWN Psychiatry      Delirium
DWN Cardio stuff    Lucid
DWN Cardio stuff    Delirium
DWN Blood .....     Lucid
DWN Blood .....     Lucid
DWN Blood .....     Lucid

The output I would want for the example above for train would be

DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid

and test

DWN Communication   Lucid
DWN Communication   Delirium

I've tried

df.groupby(['suffix','mental_status']).filter(lambda x: len(x) > 1)

but it doesn't ensure there is one of each Delirium and Lucid for each suffix.

As another example,

    suf ms
11  blood   delirium
15  blood   delirium
0   blood   lucid
1   blood   lucid
5   blood   lucid
6   blood   lucid
10  blood   lucid
16  blood   lucid
3   psych   delirium
19  psych   delirium
4   psych   lucid
8   psych   lucid
9   psych   lucid
13  psych   lucid
14  psych   lucid
18  psych   lucid
7   stool   delirium
2   stool   lucid
12  stool   lucid
17  stool   lucid

would be split into

    suf ms
11  blood   delirium
0   blood   lucid
1   blood   lucid
5   blood   lucid
6   blood   lucid
3   psych   delirium
4   psych   lucid
8   psych   lucid
9   psych   lucid
13  psych   lucid

and

    suf ms
15  blood   delirium
10  blood   lucid
16  blood   lucid
19  psych   delirium
14  psych   lucid
18  psych   lucid

Edit:

I think I arrived at a solution:

from sklearn.model_selection import train_test_split
def custom_split(df, test_size):
    f = lambda ms: (
        df[df.mental_status==ms]
        .groupby('suffix')
        .filter(lambda x:len(x)>=1/min(test_size,1-test_size))
        .suffix
    )
    delirium = f('Delirium')
    lucid = f('Lucid')
    valid_suffixes = set(delirium).intersection(lucid)
    df = df[df.suffix.isin(valid_suffixes)]
    df_train, df_test = train_test_split(df,test_size,random_state=9,
                                         stratify=df.mental_status+df.suffix)
    return df_train, df_test

Solution 1:^[1]

what about using the index and duplicated(keep=False)?

test = df[~df.duplicated(keep=False)]
output  = df.drop_duplicates()
output = output[~output.index.isin(test.index)]

Solution 2:^[2]

Does it work if you just do df.drop_duplicates(inplace=True)? It looks like that’s all there is to it because you just want the unique values based on the combination of both columns.

Solution 3:^[3]

Use a mask filter:

mask = df["DWN" in df['suffix']]
df = df[mask]
df.dropna(inplace=True)

use Groupby

data="""index,suf,ms
11,blood,delirium
15,blood,delirium
0,blood,lucid
1,blood,lucid
5,blood,lucid
6,blood,lucid
10,blood,lucid
16,blood,lucid
3,psych,delirium
19,psych,delirium
4,psych,lucid
8,psych,lucid
9,psych,lucid
13,psych,lucid
14,psych,lucid
18,psych,lucid
7,stool,delirium
2,stool,lucid
12,stool,lucid
17,stool,lucid"""

df = pd.read_csv(io.StringIO(data), sep=',')
#print(df)
lst=[]
grouped=df.groupby(['suf','ms'])
for group in grouped.groups:
    lst.append(group)
    
print(lst)

output:

[('blood', 'delirium'), ('blood', 'lucid'), ('psych', 'delirium'), ('psych', 'lucid'), ('stool', 'delirium'), ('stool', 'lucid')]

you can now use the tuples as filters

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	user1889297
Solution 2	anarchy
Solution 3

'Custom train-test split using two stratified classes

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]