'Understanding caret and oob for Random Forest

with a little help I've got a RF tuning via caret running and the more I think about it the less clear I am.

So here is what I can't get my head around:

If I want to tune e.g. .mtry via caret, caret will build n models (where n=number in trainControl), each with a different subset of my data.
If I cannot afford a separate test data set I can use method="oob" in trainControl. This results in whatever metric I have defined as part of my train command being evaluated on the oob samples of a given subset and then picks the tuning settings with the "best" metric. Correct?
Once I've got my final model with tuned .mtry I still want to get a feel if there is an indication for overtraining and again I do not want to use a separate test data set but I was hoping that I could somehow rely on oob error. That said I was going to compare some metric of the in-bag samples vs. the oob samples in each tree and somehow aggregate that across all trees. Does that make any sense if I'm willing to accept known limitations of oob error or is there a better strategy if I cannot afford a test data set? This sounds like a really stupid question to me but I cannot find anything using that approach so a vague feeling tells me this idea is maybe not one of the better ones...
I also have to admit I don't fully understand the outputs even after studying the documentation. The below code delivers a Random Forest model rf which after careful inspection provides predicted values from rf[["finalModel"]][["predicted"]] but it has all these additional rows from upsampling. Any ideas how to get rid of them returning as many rows as in input.clean and in the same order?

Thanks in advance!

library(randomForest)
library(caret)
library(dplyr)

# data set for debugging in RStudio
data("imports85")
input<-imports85

# settings
type <- "classification" # either "classification" or "regression" from SF doc prop
if (type=="classification") { #only when using imports85 data set for debugging
dependent<-"make"
} else if (type=="regression") {
dependent<-"curbWeight"
}
impute <- "no"
ntree <- 500

# Clean up input data and impute if requested
input.labelled <- input[complete.cases(input[,dependent]),] #split off rows w/o dependent
if (impute=="no") { 
    input.clean <- input.labelled[complete.cases(input.labelled),] #drop cases w/ missing variables
} else if (impute=="yes") {
    input.clean <- rfImpute(input.labelled[,dependent] ~ .,input.labelled)[,-1] # impute missing variables and remove duplicate of dependent column which is added as 1st column
}
if (type=="classification") {input.clean[,dependent] <- droplevels(input.clean[,dependent])}

# define dependent variable Y and input variables x
Y <- input.clean[, names(input.clean) == dependent]
x <- input.clean[, names(input.clean) != dependent]

# tune RF model
if (type=="classification") {
cntrl<-trainControl(method = "oob", number=5, sampling="up", search="random", verboseIter=TRUE, savePredictions=TRUE)
mtry <- var_seq(ncol(input.clean)-1, classification=TRUE, len=5)[2:5] #create values for no. of variables per tree to be tested, skip the lowest number
} else if (type=="regression") {
cntrl<-trainControl(method = "oob", number=5, search='random', verboseIter=TRUE, savePredictions=TRUE)
mtry <- var_seq(ncol(input.clean)-1, classification=FALSE, len=5)[2:5] #create values for no. of variables per tree to be tested, skip the lowest number
}
tunegrid <- expand.grid(.mtry = mtry) #create tunegrid
if (type=="classification") {
metric <- "Accuracy"
maximize <- TRUE
} else if (type=="regression") {
metric <- "RMSE"
maximize <- FALSE
}
rf <- train(x, Y, method="rf", metric=metric, maximize=maximize, ntree=ntree, trControl=cntrl, tuneGrid=tunegrid)

r random-forest caret

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Understanding caret and oob for Random Forest

Sources

Related Questions