'Custom train-test split using two stratified classes
I am trying to use NLP techniques to predict the mental status of patients at the time doctor's notes were taken. I have two classes, suffix (multilabel) as well as mental_status (binary, Lucid or Delirium). I am trying to build train and test sets from a reduced list that meet the following specifications:
- Shared unique suffix list
- For each set for each suffix, at least one row for Lucid and Delirium
Any imbalances I will correct with sample weights (I am only trying to predict mental_status, but I need the data to also be balanced on suffix). I just need at least one row for each combination within the full Cartesian product of suffixes and mental_statuses.
For example, in the dataset below, only the DWM Communication suffixes would make it into the train-test split.
suffix mental_status
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
DWN Psychiatry Lucid
DWN Psychiatry Delirium
DWN Psychiatry Delirium
DWN Cardio stuff Lucid
DWN Cardio stuff Delirium
DWN Blood ..... Lucid
DWN Blood ..... Lucid
DWN Blood ..... Lucid
The output I would want for the example above for train would be
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
and test
DWN Communication Lucid
DWN Communication Delirium
I've tried
df.groupby(['suffix','mental_status']).filter(lambda x: len(x) > 1)
but it doesn't ensure there is one of each Delirium and Lucid for each suffix.
As another example,
suf ms
11 blood delirium
15 blood delirium
0 blood lucid
1 blood lucid
5 blood lucid
6 blood lucid
10 blood lucid
16 blood lucid
3 psych delirium
19 psych delirium
4 psych lucid
8 psych lucid
9 psych lucid
13 psych lucid
14 psych lucid
18 psych lucid
7 stool delirium
2 stool lucid
12 stool lucid
17 stool lucid
would be split into
suf ms
11 blood delirium
0 blood lucid
1 blood lucid
5 blood lucid
6 blood lucid
3 psych delirium
4 psych lucid
8 psych lucid
9 psych lucid
13 psych lucid
and
suf ms
15 blood delirium
10 blood lucid
16 blood lucid
19 psych delirium
14 psych lucid
18 psych lucid
Edit:
I think I arrived at a solution:
from sklearn.model_selection import train_test_split
def custom_split(df, test_size):
f = lambda ms: (
df[df.mental_status==ms]
.groupby('suffix')
.filter(lambda x:len(x)>=1/min(test_size,1-test_size))
.suffix
)
delirium = f('Delirium')
lucid = f('Lucid')
valid_suffixes = set(delirium).intersection(lucid)
df = df[df.suffix.isin(valid_suffixes)]
df_train, df_test = train_test_split(df,test_size,random_state=9,
stratify=df.mental_status+df.suffix)
return df_train, df_test
Solution 1:[1]
what about using the index and duplicated(keep=False)?
test = df[~df.duplicated(keep=False)]
output = df.drop_duplicates()
output = output[~output.index.isin(test.index)]
Solution 2:[2]
Does it work if you just do df.drop_duplicates(inplace=True)? It looks like that’s all there is to it because you just want the unique values based on the combination of both columns.
Solution 3:[3]
Use a mask filter:
mask = df["DWN" in df['suffix']]
df = df[mask]
df.dropna(inplace=True)
use Groupby
data="""index,suf,ms
11,blood,delirium
15,blood,delirium
0,blood,lucid
1,blood,lucid
5,blood,lucid
6,blood,lucid
10,blood,lucid
16,blood,lucid
3,psych,delirium
19,psych,delirium
4,psych,lucid
8,psych,lucid
9,psych,lucid
13,psych,lucid
14,psych,lucid
18,psych,lucid
7,stool,delirium
2,stool,lucid
12,stool,lucid
17,stool,lucid"""
df = pd.read_csv(io.StringIO(data), sep=',')
#print(df)
lst=[]
grouped=df.groupby(['suf','ms'])
for group in grouped.groups:
lst.append(group)
print(lst)
output:
[('blood', 'delirium'), ('blood', 'lucid'), ('psych', 'delirium'), ('psych', 'lucid'), ('stool', 'delirium'), ('stool', 'lucid')]
you can now use the tuples as filters
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user1889297 |
| Solution 2 | anarchy |
| Solution 3 |
