'Calculate the pair-wise correlation between distinct class pairs over two feature columns and the target variable?
Most similar questions relating to calculating this involve a single correlation value for each feature column, showing how the features in a dataset correlate to the target variable.
I'd like to do this row-wise for each distinct pairing of the two values between two feature columns.
e.g.,
This is an example of the dataset before the string labels are encoded numerically:
PrimaryProcedure | SecondaryProcedure | LengthOfStay |
---|---|---|
pre_op | brain_surgery | 30 |
pre_op | spinal_implant | 14 |
pre_op | spinal_implant | 10 |
check_up | NULL | 1 |
I'd like a table that shows how strongly each of the distinct class-pairs within the two feature columns for procedures correlate to a patient's length of stay.
e.g.,
This is an example of the dataset I'd like to produce:
DistinctPairwiseProcedures | Correlation |
---|---|
(pre_op, brain_surgery) | 0.7 |
(pre_op, spinal_implant) | 0.4 |
(check_up) | 0.9 |
In summary, a dataframe containing the distinct pairs of procedures and how strongly correlated they are to the target variable, LengthOfStay. I could then sort this dataframe to see which combinations of procedures could accurately be fed into a regression model.
The code below allows me to get a list of the distinct pairwise procedures, however, I'm not sure how to use this list as an index for calculating the correlation to LengthOfStay for each.
from itertools import product
print(list(product(dataframe['PrimaryProcedure'].unique(), dataframe['SecondaryProcedure'].unique())))
DistinctPairwiseProcedures |
---|
(pre_op, brain_surgery) |
(pre_op, spinal_implant) |
(check_up) |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|