'How can I apply class representation requirements to the nodes of an rpart model in R?

Consider the following data:

dat<- structure(list(Success = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 
0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 
1L, 1L, 1L, 1L), Var1 = c("A", "B", "A", "A", "A", "B", "A", 
"A", "A", "B", "A", "B", "B", "B", "A", "B", "A", "B", "B", "B", 
"B", "B", "A", "A", "B", "B", "B", "B", "B"), Score = 1:29), class = "data.frame", row.names = 
c(NA, 
-29L))

In this data, "Success" is a binary flag, with 1 representing success and 0 representing failure. Score is a numeric score on some measure we want to use to predict Success. Var1 is a class variable, used to identify subjects belonging to a particular categorical grouping.

I am interested in making a model with rpart, using the "Score" variable to predict success, but with minimum and/or maximum class representations required.

A tree in rpart is constructed simply enough with:

mod<-rpart(Success~Score,data=dat)

In looking at the output of:

dat_out<-cbind(dat,predict(mod))

dat_out_sum<-dat_out%>%
  group_by(`predict(mod)`)%>%
  count(Var1)

We see that 7 of the ten samples with the lower predicted value belong to the group with a Var1 value of "A".

Say, for our purposes, this proportion is unacceptable. We would like the best tree possible, but with the additional requirement that a maximum of 60% of the records belonging to this lower node have a Var1 value of A.

Is there an argument or way to modify rpart() code to find the best-fitting model that meets this parameter?

I've scoured the documentation here: https://cran.r-project.org/web/packages/rpart/rpart.pdf and so far think I've got a few ideas how this might be able to happen, but haven't managed to get the coding correct.

Approach 1 - Manual identification of acceptable split points.

With a little manipulation, I can arrange my Score variable from smallest to largest, add columns that display the proportions of the different classes within Var1, and flag all places where a split could occur that meets the requirements on the proportions for Var1. Probably not the best code, but the code below adds a column with all acceptable split points flagged.

dat2<-dat%>%
  mutate(one=1)%>%
  group_by(Var1)%>%
  mutate(Members=cumsum(one))%>%
  ungroup()%>%
  mutate(Total_People=cumsum(one))%>%
  mutate(Proportion=Members/Total_People)%>%
  pivot_wider(names_from = c(Var1),values_from=c(Proportion))%>%
  rowwise()%>%
  mutate(A=ifelse(is.na(A),1-sum(A,B,na.rm=TRUE),A))%>%
  mutate(B=ifelse(is.na(B),1-sum(A,B,na.rm=TRUE),B))%>%
  mutate(PotentialSplit=ifelse(A<.6,1,0))

What I'd need to do from here, and have yet to figure out, is to determine how to tell rpart() that only split points where PotentialSplit==1 are acceptable.

Approach 2 - Rework the minsplit argument in rpart.control

In building models with rpart(), we can specify a raw minimum number of observations with something like:

modelalt<-rpart(Success~Score, data=dat, control=rpart.control(minsplit=5))

I hoped that I could add some qualifiers to minsplit. Note that the following code crashes my R session when I try running it. Here's my breaks-my-r-session attempt (without the #):

#modalt<-rpart(Success~Score,data=dat,control=rpart.control(minsplit=5[dat$Var1=="B"]))

So far, I've had no success in exploring either way. Ideally, I'm hoping for a solution that will allow for flexibility in the number of classes that appear in the constraining categorical variable, and also in the relationship of the predictor(s) and the target variable. Any help would be greatly appreciated.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source